MobileRead Forums - View Single Post

roger64 · 07-06-2021, 10:45 AM

PDF with a text layer ("hybrid PDF") can of course be processed but image PDF need to be converted to image format and this is not optimal.

Tesseract will soon reach v 5.0. (an alpha version has been published some months ago). From v 4.0. onward, it has been using a neural engine with improved results.

If you output to text, you can quickly process a full book. The HOCR format is heavier to handle.

I'll join some examples of OCR results. I need a bit of time because currently where I am there are no French paper books...

07-06-2021, 10:45 AM	#12
roger64 Wizard Posts: 2,626 Karma: 3120635 Join Date: Jan 2009 Device: Kindle PW3 (wifi)	PDF with a text layer ("hybrid PDF") can of course be processed but image PDF need to be converted to image format and this is not optimal. Tesseract will soon reach v 5.0. (an alpha version has been published some months ago). From v 4.0. onward, it has been using a neural engine with improved results. If you output to text, you can quickly process a full book. The HOCR format is heavier to handle. I'll join some examples of OCR results. I need a bit of time because currently where I am there are no French paper books...