View Single Post
Old 04-02-2023, 07:34 PM   #1717
retval
Enthusiast
retval began at the beginning.
 
Posts: 38
Karma: 10
Join Date: Mar 2011
Device: none
An add-in that integrates OCR to pdf scanned as image.

Quote:
Originally Posted by feuille View Post
Hello retval,
I like this idea. I've always converted scanned documents outside of Calibre, but a plugin would have some advantages for the process flow (save the desired output format via Calibre's conversion tool, writing metadata, etc.).
The OCR SDK shouldn't be a problem. I have developed a Python program (I call it Doculyzer) that can be used in companies and government agencies to extract data from scanned PDF files or faxes and store it in a backend database. I use Tesseract as the OCR library, which delivers very good results (see attached example).
If I have some time in the next few weeks, I'll integrate the Python code into a Calibre GUI plugin. Suggestions are welcome!
Great, I'm completely ignorarte in programming, I'm just a user. Of the OCR programs I know the one that I think achieves the best results is Foxyt. Many programs bog down the pc, as pdfs without OCR are often too many megabytes in weight. I don't know if that will be a limitation.
Hopefully you can create a plug-in that integrates OCR.
Best of luck with it.
PS: what program do you use for OCR? The test you upload is very good.

Last edited by retval; 04-02-2023 at 07:40 PM.
retval is offline   Reply With Quote