MobileRead Forums - View Single Post

snarkophilus · 12-24-2019, 12:56 AM

Quote:

Originally Posted by kboogie222

Strictly from an identification vantage, do you think the regex posted here would do a decent job of identifying the breaks for the purpose of a quality check? Would it ignore the title edge case? Are there other edge cases that you would consider for the purposes of quality check and finding books with this problem?

I don't think the regex needs to be especially complex. We might need to think about cases where there's a span at the end of the paragraph too, so looking for things like paragraphs that finish with [letter]</span></p> for example. As mentioned in a few of the threads I linked to, songs and verses often finish a line without any punctuation, so they are likely to throw out false positives. That said a regex like this

Code:

[\w",](</span>)?</p>

might be a good place to start?? Of course, one could always make the detection regex configurable in the plugin too!

Quote:

I wish I knew more about the Quality Check plugin architecture. Once we had a tuned up regex fingerprint, is the Quality Check plugin capable of searching across a library? Would it be straight forward to implement the search with an adjustable threshold and sample size?

The "run across the library" thing looks quite easy at a glance. Picking a Quality Check check that looks in HTML files there's not that much code involved:

Spoiler:

Instead of just looking for at least one match for the regex, you could count the number of times the broken sentence regex appears and return "true" if more than certain (configurable?) threshold.

It seems like your original goal of detecting all epubs in a library that have possible broken sentences doesn't seem that hard (he says!). Fixing those automatically? No thanks

I'm still very new to Calibre plugins, so I may be leading you down the wrong path. So take all that I said about with a grain of salt, especially if someone more knowledgeable says something that contradicts me