Quote:
Originally Posted by kboogie222
Strictly from an identification vantage, do you think the regex posted here would do a decent job of identifying the breaks for the purpose of a quality check? Would it ignore the title edge case? Are there other edge cases that you would consider for the purposes of quality check and finding books with this problem?
|
I don't think the regex needs to be especially complex. We might need to think about cases where there's a span at the end of the paragraph too, so looking for things like paragraphs that finish with
[letter]</span></p> for example. As mentioned in a few of the threads I linked to, songs and verses often finish a line without any punctuation, so they are likely to throw out false positives. That said a regex like this
Code:
[\w",](</span>)?</p>
might be a good place to start?? Of course, one could always make the detection regex configurable in the plugin too!
Quote:
I wish I knew more about the Quality Check plugin architecture. Once we had a tuned up regex fingerprint, is the Quality Check plugin capable of searching across a library? Would it be straight forward to implement the search with an adjustable threshold and sample size?
|
The "run across the library" thing looks quite easy at a glance. Picking a Quality Check check that looks in HTML files there's not that much code involved:
Spoiler:
Code:
def check_epub_address(self):
RE_ADDRESS = re.compile(r'</address>', re.UNICODE)
def evaluate_book(book_id, db):
path_to_book = db.format_abspath(book_id, 'EPUB', index_is_id=True)
if not path_to_book:
self.log.error('ERROR: EPUB format is missing: ', get_title_authors_text(db, book_id))
return False
try:
with ZipFile(path_to_book, 'r') as zf:
for resource_name in self._manifest_worthy_names(zf):
extension = resource_name[resource_name.rfind('.'):].lower()
if extension in NON_HTML_FILES:
continue
else:
data = zf.read(resource_name).lower()
if RE_ADDRESS.search(data):
return True
return False
except InvalidEpub as e:
self.log.error('Invalid epub:', e)
return False
except:
self.log.error('ERROR parsing book: ', path_to_book)
self.log(traceback.format_exc())
return False
self.check_all_files(evaluate_book,
no_match_msg='No searched ePub books have \<address\> smart tags',
marked_text='epub_address_tags',
status_msg_type='ePub books for <address> smart tags')
Instead of just looking for at least one match for the regex, you could count the number of times the broken sentence regex appears and return "true" if more than certain (configurable?) threshold.
It seems like your original goal of detecting all epubs in a library that have
possible broken sentences doesn't seem that hard (he says!). Fixing those automatically? No thanks
I'm still very new to Calibre plugins, so I may be leading you down the wrong path. So take all that I said about with a grain of salt, especially if someone more knowledgeable says something that contradicts me