MobileRead Forums - View Single Post

kboogie222 · 12-24-2019, 12:09 AM

Quote:

Originally Posted by snarkophilus

Great idea, but I can see this getting hairy quickly! Some books use lower case chapter names, so an algorithm that was smart enough to pick lower case letters at the start of a paragraph style instead of a chapter name style would be nice.

Thanks so much for the direction, the links are super helpful and contain some useful regex and thinking around identifying the problems and fixing the problems. It seems like this has been a big challenge in the community that dates back over decade or longer.

Judging from the conversation, fixing the problem would take some finesse, and likely some human judgement. I'm a little nervous to even take that on, hah.. But clearly you all have been thinking about some improved approaches over the "Line un-wrap factor" that exists in Calibre.

From an identification perspective it sounds like we have two challenges; 1) accurately identifying the Bad Breaks via regex, and 2) implementing a regex search across an entire library.

Strictly from an identification vantage, do you think the regex posted here would do a decent job of identifying the breaks for the purpose of a quality check? Would it ignore the title edge case? Are there other edge cases that you would consider for the purposes of quality check and finding books with this problem?

Quote:

Find: </p>\s+<p class="calibre2">([a-z])
Replace: \1 (a space followed by \1)

I wish I knew more about the Quality Check plugin architecture. Once we had a tuned up regex fingerprint, is the Quality Check plugin capable of searching across a library? Would it be straight forward to implement the search with an adjustable threshold and sample size?

This is really interesting, thanks so much for the direction on this!