View Single Post
Old 04-02-2023, 10:04 AM   #1716
feuille
Connoisseur
feuille will become famous soon enoughfeuille will become famous soon enoughfeuille will become famous soon enoughfeuille will become famous soon enoughfeuille will become famous soon enoughfeuille will become famous soon enough
 
Posts: 62
Karma: 666
Join Date: May 2020
Location: Germany
Device: android smartphone + tablet
I like this idea

Quote:
Originally Posted by retval View Post
Greetings to all, I wanted to suggest an add-on that allows to insert OCR in pdfs.
As you may have experienced there are old pdf's that were scanned as image. That's why you can't use Calibre's full text search.
Hello retval,
I like this idea. I've always converted scanned documents outside of Calibre, but a plugin would have some advantages for the process flow (save the desired output format via Calibre's conversion tool, writing metadata, etc.).
The OCR SDK shouldn't be a problem. I have developed a Python program (I call it Doculyzer) that can be used in companies and government agencies to extract data from scanned PDF files or faxes and store it in a backend database. I use Tesseract as the OCR library, which delivers very good results (see attached example).
If I have some time in the next few weeks, I'll integrate the Python code into a Calibre GUI plugin. Suggestions are welcome!
Attached Thumbnails
Click image for larger version

Name:	scanned_pdf_2.jpg
Views:	562
Size:	249.2 KB
ID:	200750  
Attached Files
File Type: txt scanned_pdf_2.txt (1.7 KB, 286 views)
feuille is offline   Reply With Quote