MobileRead Forums - View Single Post

feuille · 04-02-2023, 10:04 AM

Quote:

Originally Posted by retval

Greetings to all, I wanted to suggest an add-on that allows to insert OCR in pdfs.
As you may have experienced there are old pdf's that were scanned as image. That's why you can't use Calibre's full text search.

Hello retval,
I like this idea. I've always converted scanned documents outside of Calibre, but a plugin would have some advantages for the process flow (save the desired output format via Calibre's conversion tool, writing metadata, etc.).
The OCR SDK shouldn't be a problem. I have developed a Python program (I call it Doculyzer) that can be used in companies and government agencies to extract data from scanned PDF files or faxes and store it in a backend database. I use Tesseract as the OCR library, which delivers very good results (see attached example).
If I have some time in the next few weeks, I'll integrate the Python code into a Calibre GUI plugin. Suggestions are welcome!