Firstly, thanks to drMerry for the suggestions and testing in this thread. It has become obvious from several of you that the original regex used in this plugin was extremely conservative. For this release I have used a variant of what drMerry proposed (no longer looking for textual prefixes like ISBN) which significantly increases the match rate.
I have also replaced the PDF processing to something that is many orders of magnitude faster, by only scanning the first 10 and last 5 pages of a PDF.
Changes in v1.2:
- Rewritten for new plugin infrastructure in Calibre 0.7.53
- ISBN matching regex replaced
- PDFs now processed with new Calibre PDF engine to scan just first 10 and last 5 pages
See the attached text document for my test cases. Note that this release still makes no attempts to catch bad OCR scans (e.g. O instead of 0, I instead of 1 etc). It also will not match numbers split across multiple lines, or text underneath graphics. I have also not as yet optimised scanning non PDF formats.
It should however run significantly faster for PDFs and give you more matches than previously.