PDF with a text layer ("hybrid PDF") can of course be processed but image PDF need to be converted to image format and this is not optimal.
Tesseract will soon reach v 5.0. (an alpha version has been published some months ago). From v 4.0. onward, it has been using a neural engine with improved results.
If you output to text, you can quickly process a full book. The HOCR format is heavier to handle.
I'll join some examples of OCR results. I need a bit of time because currently where I am there are no French paper books...
|