Quote:
Originally Posted by retval
Greetings to all, I wanted to suggest an add-on that allows to insert OCR in pdfs.
As you may have experienced there are old pdf's that were scanned as image. That's why you can't use Calibre's full text search.
|
Hello retval,
I like this idea. I've always converted scanned documents outside of Calibre, but a plugin would have some advantages for the process flow (save the desired output format via Calibre's conversion tool, writing metadata, etc.).
The OCR SDK shouldn't be a problem. I have developed a Python program (I call it Doculyzer) that can be used in companies and government agencies to extract data from scanned PDF files or faxes and store it in a backend database. I use Tesseract as the OCR library, which delivers very good results (see attached example).
If I have some time in the next few weeks, I'll integrate the Python code into a Calibre GUI plugin. Suggestions are welcome!