Quote:
Originally Posted by feuille
Hello retval,
I like this idea. I've always converted scanned documents outside of Calibre, but a plugin would have some advantages for the process flow (save the desired output format via Calibre's conversion tool, writing metadata, etc.).
The OCR SDK shouldn't be a problem. I have developed a Python program (I call it Doculyzer) that can be used in companies and government agencies to extract data from scanned PDF files or faxes and store it in a backend database. I use Tesseract as the OCR library, which delivers very good results (see attached example).
If I have some time in the next few weeks, I'll integrate the Python code into a Calibre GUI plugin. Suggestions are welcome!
|
Great, I'm completely ignorarte in programming, I'm just a user. Of the OCR programs I know the one that I think achieves the best results is Foxyt. Many programs bog down the pc, as pdfs without OCR are often too many megabytes in weight. I don't know if that will be a limitation.
Hopefully you can create a plug-in that integrates OCR.
Best of luck with it.
PS: what program do you use for OCR? The test you upload is very good.