Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Plugins

Notices

Reply
 
Thread Tools Search this Thread
Old 08-17-2019, 04:12 AM   #931
capink
Wizard
capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.
 
Posts: 1,095
Karma: 1954136
Join Date: Aug 2015
Device: Kindle
One of the good predictors of epub quality is the size of the css file. epubs with css less than 1 Kb usually turn out to be of bad quality. I checked my library that way outside calibre by extracting the size of the css for each epub. My coding skills are not really up to the task of adding such functionality to this plugin. Is someone is still maintaining the plugin he might consider adding this.
capink is offline   Reply With Quote
Old 08-17-2019, 08:56 AM   #932
theducks
Well trained by Cats
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 29,820
Karma: 54830978
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
Quote:
Originally Posted by capink View Post
One of the good predictors of epub quality is the size of the css file. epubs with css less than 1 Kb usually turn out to be of bad quality. I checked my library that way outside calibre by extracting the size of the css for each epub. My coding skills are not really up to the task of adding such functionality to this plugin. Is someone is still maintaining the plugin he might consider adding this.
Boy is that a wide shot.

IMHO that is a possible sign of bloat
Here is my basic stylesheet (modeled after Webcriptions of old), 609 bytes and it includes some stuff some would have kittens over.

Code:
body{
    display: block;
    font-size: 1.2em;
    margin-bottom: 0;
    margin-left: 2pt;
    margin-right: 2pt;
    margin-top: 0;
    padding-left: 0;
    padding-right: 0;
    text-align: justify
    }
	
.indented{
	display: block;
   margin: 0.5em 0 0 0;
   text-indent: 1.5em;
	}
.nonindented{
	display: block;
   margin: 0.5em 0 0 0;
   text-indent: 0;
	}
	
.chapno {
   display: block;
   font-size: 1.5em;
   margin: 1em 0;
   border: 0;
   padding: 0;
   text-indent: 0;
   text-align: center;
}
.scene {
	display: block;
	margin: 1em 0; 
	text-align: center;
 }
theducks is online now   Reply With Quote
Advert
Old 08-17-2019, 03:38 PM   #933
DNSB
Bibliophagist
DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.
 
DNSB's Avatar
 
Posts: 35,513
Karma: 145557716
Join Date: Jul 2010
Location: Vancouver
Device: Kobo Sage, Forma, Clara HD, Lenovo M8 FHD, Paperwhite 4, Tolino epos
Quote:
Originally Posted by theducks View Post
Boy is that a wide shot.

IMHO that is a possible sign of bloat
Here is my basic stylesheet (modeled after Webcriptions of old), 609 bytes and it includes some stuff some would have kittens over.
I just checked my collection and found ~120 book that had a CSS file under 1KB. Removing those that had multiple CSS files, of the ~50 remaining, several did look as if the CSS had been trimmed to remove unused styles but overall, none of them looked bad. Perhaps a bit simple layout but eminently readable.
DNSB is offline   Reply With Quote
Old 08-17-2019, 04:32 PM   #934
JSWolf
Resident Curmudgeon
JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.
 
JSWolf's Avatar
 
Posts: 74,049
Karma: 129333562
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
One thing I really dislike is the margin shortcut. It doesn't need to exist. Just use the full margin commands as it's easier to read.

What I do is load the eBook into Calibre. I then delete every useless HTML file and if there are multiple CSS, I delete the ones I no longer need and merge the rest. I also have Calibre remove unused CSS in the CSS file and HTML. Then I add in my own body and p classes. Then I make any other changes needed.

The problem is that if I drop in my own full CSS, I end up having to figure out what from the publisher CSS goes with my CSS and what I cannot dump as I don't have a version. It's too much hassle. it's a lot easier to modify what's there.
JSWolf is offline   Reply With Quote
Old 09-15-2019, 09:52 AM   #935
theducks
Well trained by Cats
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 29,820
Karma: 54830978
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
Request: for Check (triggered by a user question in Library Management)

Check (structure?) for EPUB Type: 2 OR 3 <selection
theducks is online now   Reply With Quote
Advert
Old 09-18-2019, 06:04 PM   #936
icallaci
Guru
icallaci ought to be getting tired of karma fortunes by now.icallaci ought to be getting tired of karma fortunes by now.icallaci ought to be getting tired of karma fortunes by now.icallaci ought to be getting tired of karma fortunes by now.icallaci ought to be getting tired of karma fortunes by now.icallaci ought to be getting tired of karma fortunes by now.icallaci ought to be getting tired of karma fortunes by now.icallaci ought to be getting tired of karma fortunes by now.icallaci ought to be getting tired of karma fortunes by now.icallaci ought to be getting tired of karma fortunes by now.icallaci ought to be getting tired of karma fortunes by now.
 
Posts: 769
Karma: 6528026
Join Date: Sep 2012
Device: Kobo Elipsa
Is there any possibility of adding a check for invalid CSS properties to this wonderful plugin? It does everything else I need, so a check for invalid CSS properties would make my life complete. Thank you for this very useful tool.

Last edited by icallaci; 09-18-2019 at 07:44 PM.
icallaci is offline   Reply With Quote
Old 12-23-2019, 12:37 PM   #937
kboogie222
Junior Member
kboogie222 began at the beginning.
 
Posts: 8
Karma: 10
Join Date: Aug 2019
Location: New Jersey
Device: Kindle Oasis 2
[Enhancement] Bad Breaks Search

Sadly I am not a engineer, but if anyone finds the challenge interesting I'm happy to help with testing, etc. Thanks everyone for making this such a great application and plugin.

USER STATEMENT
As a Calibre user, I would like to be able to detect books with "Bad Breaks" so that I can repair or replace them with more readable versions.

BACKGROUND
As a result of poor conversion there are often books that have "Bad Breaks", where a line break is inserted mid sentence. This results in a new line that typically begins with a lowercase letter. This is very common and likely one of the biggest quality and readability issues with many Calibre user libraries. Unfortunately, there is no easy way to search an entire library and identify books that have Bad Breaks.

ACCEPTANCE CRITERIA
* User is able to search an entire library for books that have bad breaks
* User can search mobi, epub and azw formats
* User is able to set a threshold for number of bad breaks identified
* Results are displayed in filtered view

EXTRA CREDIT
* User is able to sample book and set page size or word count for sample size
* User can search additional document formats
Attached Thumbnails
Click image for larger version

Name:	bad breaks example and notes.png
Views:	144
Size:	184.9 KB
ID:	175838  
kboogie222 is offline   Reply With Quote
Old 12-23-2019, 06:31 PM   #938
snarkophilus
Wannabe Connoisseur
snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.
 
Posts: 425
Karma: 2516674
Join Date: Apr 2011
Location: Geelong, Australia
Device: Kobo Libra 2, Kobo Aura 2, Sony PRS-T1, Sony PRS-350, Palm TX
Quote:
Originally Posted by kboogie222 View Post
USER STATEMENT
As a Calibre user, I would like to be able to detect books with "Bad Breaks" so that I can repair or replace them with more readable versions.
Great idea, but I can see this getting hairy quickly! Some books use lower case chapter names, so an algorithm that was smart enough to pick lower case letters at the start of a paragraph style instead of a chapter name style would be nice. Maybe something that counts the number of time a style was used? I've also seen cases where a paragraph that finishes with a lower case letter or a comma and the next starts with an upperc case character are still "bad breaks".

I'm sure there's someone in the Sigil world who has built up a fancy regex to find many of these. (Quick search...) There are some examples here, here, here and here.

Definitely a handy one if it could be implemented.
snarkophilus is offline   Reply With Quote
Old 12-23-2019, 09:18 PM   #939
theducks
Well trained by Cats
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 29,820
Karma: 54830978
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
Quote:
Originally Posted by snarkophilus View Post
Great idea, but I can see this getting hairy quickly! Some books use lower case chapter names, so an algorithm that was smart enough to pick lower case letters at the start of a paragraph style instead of a chapter name style would be nice. Maybe something that counts the number of time a style was used? I've also seen cases where a paragraph that finishes with a lower case letter or a comma and the next starts with an upperc case character are still "bad breaks".

I'm sure there's someone in the Sigil world who has built up a fancy regex to find many of these. (Quick search...) There are some examples here, here, here and here.

Definitely a handy one if it could be implemented.
I have about 5 or 6 that I use in Sigil. Only 2, do I ever run using the 'All" button. The rest I step thru (it is still a very fast operation) , some times I do skip the replace .
There are still EXCEPTIONS . (lots of Publishers Boilerplate should not be touched).
I am currently reading a book that has a acronym that starts with a lower case letter.
A.M. or P.M. will fail. (One of my searches dos fix Mr. Mrs. ... splits )
Still, Nothing beats the human eyeball for spotting errors
theducks is online now   Reply With Quote
Old 12-24-2019, 12:09 AM   #940
kboogie222
Junior Member
kboogie222 began at the beginning.
 
Posts: 8
Karma: 10
Join Date: Aug 2019
Location: New Jersey
Device: Kindle Oasis 2
Quote:
Originally Posted by snarkophilus View Post
Great idea, but I can see this getting hairy quickly! Some books use lower case chapter names, so an algorithm that was smart enough to pick lower case letters at the start of a paragraph style instead of a chapter name style would be nice.
Thanks so much for the direction, the links are super helpful and contain some useful regex and thinking around identifying the problems and fixing the problems. It seems like this has been a big challenge in the community that dates back over decade or longer.

Judging from the conversation, fixing the problem would take some finesse, and likely some human judgement. I'm a little nervous to even take that on, hah.. But clearly you all have been thinking about some improved approaches over the "Line un-wrap factor" that exists in Calibre.

From an identification perspective it sounds like we have two challenges; 1) accurately identifying the Bad Breaks via regex, and 2) implementing a regex search across an entire library.

Strictly from an identification vantage, do you think the regex posted here would do a decent job of identifying the breaks for the purpose of a quality check? Would it ignore the title edge case? Are there other edge cases that you would consider for the purposes of quality check and finding books with this problem?
Quote:
Find: </p>\s+<p class="calibre2">([a-z])
Replace: \1 (a space followed by \1)
I wish I knew more about the Quality Check plugin architecture. Once we had a tuned up regex fingerprint, is the Quality Check plugin capable of searching across a library? Would it be straight forward to implement the search with an adjustable threshold and sample size?

This is really interesting, thanks so much for the direction on this!
kboogie222 is offline   Reply With Quote
Old 12-24-2019, 12:56 AM   #941
snarkophilus
Wannabe Connoisseur
snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.
 
Posts: 425
Karma: 2516674
Join Date: Apr 2011
Location: Geelong, Australia
Device: Kobo Libra 2, Kobo Aura 2, Sony PRS-T1, Sony PRS-350, Palm TX
Quote:
Originally Posted by kboogie222 View Post
Strictly from an identification vantage, do you think the regex posted here would do a decent job of identifying the breaks for the purpose of a quality check? Would it ignore the title edge case? Are there other edge cases that you would consider for the purposes of quality check and finding books with this problem?
I don't think the regex needs to be especially complex. We might need to think about cases where there's a span at the end of the paragraph too, so looking for things like paragraphs that finish with [letter]</span></p> for example. As mentioned in a few of the threads I linked to, songs and verses often finish a line without any punctuation, so they are likely to throw out false positives. That said a regex like this

Code:
[\w",](</span>)?</p>
might be a good place to start?? Of course, one could always make the detection regex configurable in the plugin too!

Quote:
I wish I knew more about the Quality Check plugin architecture. Once we had a tuned up regex fingerprint, is the Quality Check plugin capable of searching across a library? Would it be straight forward to implement the search with an adjustable threshold and sample size?
The "run across the library" thing looks quite easy at a glance. Picking a Quality Check check that looks in HTML files there's not that much code involved:

Spoiler:
Code:
    def check_epub_address(self):
        RE_ADDRESS = re.compile(r'</address>', re.UNICODE)

        def evaluate_book(book_id, db):
            path_to_book = db.format_abspath(book_id, 'EPUB', index_is_id=True)
            if not path_to_book:
                self.log.error('ERROR: EPUB format is missing: ', get_title_authors_text(db, book_id))
                return False
            try:
                with ZipFile(path_to_book, 'r') as zf:
                    for resource_name in self._manifest_worthy_names(zf):
                        extension = resource_name[resource_name.rfind('.'):].lower()
                        if extension in NON_HTML_FILES:
                            continue
                        else:
                            data = zf.read(resource_name).lower()
                            if RE_ADDRESS.search(data):
                                return True
                    return False

            except InvalidEpub as e:
                self.log.error('Invalid epub:', e)
                return False
            except:
                self.log.error('ERROR parsing book: ', path_to_book)
                self.log(traceback.format_exc())
                return False

        self.check_all_files(evaluate_book,
                             no_match_msg='No searched ePub books have \<address\> smart tags',
                             marked_text='epub_address_tags',
                             status_msg_type='ePub books for <address> smart tags')


Instead of just looking for at least one match for the regex, you could count the number of times the broken sentence regex appears and return "true" if more than certain (configurable?) threshold.

It seems like your original goal of detecting all epubs in a library that have possible broken sentences doesn't seem that hard (he says!). Fixing those automatically? No thanks

I'm still very new to Calibre plugins, so I may be leading you down the wrong path. So take all that I said about with a grain of salt, especially if someone more knowledgeable says something that contradicts me
snarkophilus is offline   Reply With Quote
Old 12-24-2019, 02:40 AM   #942
snarkophilus
Wannabe Connoisseur
snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.
 
Posts: 425
Karma: 2516674
Join Date: Apr 2011
Location: Geelong, Australia
Device: Kobo Libra 2, Kobo Aura 2, Sony PRS-T1, Sony PRS-350, Palm TX
Quote:
Originally Posted by snarkophilus View Post
As mentioned in a few of the threads I linked to, songs and verses often finish a line without any punctuation, so they are likely to throw out false positives.
Indeed, Stephen King's Christine has 145 matches for just [a-z]</p> and 245 matches for [a-z,]</p>. Almost all of these were in song verses at the start of each chapter, but there were three missing periods at the end of sentences, one comma that should have been a period and one actual occurance of a break mid-sentence.
snarkophilus is offline   Reply With Quote
Old 12-24-2019, 03:42 AM   #943
JSWolf
Resident Curmudgeon
JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.
 
JSWolf's Avatar
 
Posts: 74,049
Karma: 129333562
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
Is anyone working on making this plugin work with Python 3?
JSWolf is offline   Reply With Quote
Old 12-24-2019, 07:55 AM   #944
davidfor
Grand Sorcerer
davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.
 
Posts: 24,907
Karma: 47303748
Join Date: Jul 2011
Location: Sydney, Australia
Device: Kobo:Touch,Glo, AuraH2O, GloHD,AuraONE, ClaraHD, Libra H2O; tolinoepos
Quote:
Originally Posted by JSWolf View Post
Is anyone working on making this plugin work with Python 3?
Can you report what errors you are seeing?
davidfor is offline   Reply With Quote
Old 12-24-2019, 08:32 AM   #945
JSWolf
Resident Curmudgeon
JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.
 
JSWolf's Avatar
 
Posts: 74,049
Karma: 129333562
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
Quote:
Originally Posted by davidfor View Post
Can you report what errors you are seeing?
I'm not seeing any errors as I'm not using the 4.99+ beta. I'm just keeping track of what I use that's been updated.
JSWolf is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
[GUI Plugin] Clipboard Search kiwidude Plugins 29 04-02-2024 10:05 PM
[GUI Plugin] Search the Internet kiwidude Plugins 433 04-01-2024 05:48 PM
[GUI Plugin] Open With kiwidude Plugins 403 04-01-2024 08:39 AM
[GUI Plugin] Kindle Collections (old) meme Plugins 2070 08-11-2014 12:02 AM
[GUI Plugin] Book Sync **Deprecated** kiwidude Plugins 111 06-07-2011 07:47 PM


All times are GMT -4. The time now is 01:27 PM.


MobileRead.com is a privately owned, operated and funded community.