View Single Post
Old 07-06-2021, 09:45 AM   #12
roger64
Wizard
roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.
 
Posts: 2,625
Karma: 3120635
Join Date: Jan 2009
Device: Kindle PW3 (wifi)
PDF with a text layer ("hybrid PDF") can of course be processed but image PDF need to be converted to image format and this is not optimal.

Tesseract will soon reach v 5.0. (an alpha version has been published some months ago). From v 4.0. onward, it has been using a neural engine with improved results.

If you output to text, you can quickly process a full book. The HOCR format is heavier to handle.

I'll join some examples of OCR results. I need a bit of time because currently where I am there are no French paper books...
roger64 is offline   Reply With Quote