Quote:
Originally Posted by retval
PS: what program do you use for OCR? The test you upload is very good.
|
As I wrote, I use my own frontend. This first breaks down the PDF into individual images using the Poppler library (which Calibre also uses). Then the text is recognized using the Tesseract library. Since this is done image by image, the overall size of the PDF file is irrelevant. Tesseract (
https://en.m.wikipedia.org/wiki/Tesseract_(software)) is free software under an Apache license, for which there are various frontends. Tesseract also has training data for different fonts. For example, I have many books in German Fraktur, which I convert with the training data from the University of Mannheim.