View Single Post
Old 04-03-2023, 02:50 AM   #1718
feuille
Connoisseur
feuille will become famous soon enoughfeuille will become famous soon enoughfeuille will become famous soon enoughfeuille will become famous soon enoughfeuille will become famous soon enoughfeuille will become famous soon enough
 
Posts: 62
Karma: 666
Join Date: May 2020
Location: Germany
Device: android smartphone + tablet
Quote:
Originally Posted by retval View Post
PS: what program do you use for OCR? The test you upload is very good.
As I wrote, I use my own frontend. This first breaks down the PDF into individual images using the Poppler library (which Calibre also uses). Then the text is recognized using the Tesseract library. Since this is done image by image, the overall size of the PDF file is irrelevant. Tesseract (https://en.m.wikipedia.org/wiki/Tesseract_(software)) is free software under an Apache license, for which there are various frontends. Tesseract also has training data for different fonts. For example, I have many books in German Fraktur, which I convert with the training data from the University of Mannheim.
feuille is offline   Reply With Quote