Quote:
Originally Posted by roger64
PDF with a text layer ("hybrid PDF") can of course be processed but image PDF need to be converted to image format and this is not optimal.
|
I guess the book I experimented with must have been "hybrid." Now I understand what you're saying.
Quote:
Originally Posted by roger64
Tesseract will soon reach v 5.0. (an alpha version has been published some months ago). From v 4.0. onward, it has been using a neural engine with improved results.
|
That's good to know. I think it works pretty well now — so looking forward to the improvements.
Quote:
Originally Posted by roger64
If you output to text, you can quickly process a full book. The HOCR format is heavier to handle.
I'll join some examples of OCR results. I need a bit of time because currently where I am there are no French paper books...
|
I haven't really tried the HOCR feature yet. The text feature's only drawback is (as one poster mentioned) that it doesn't retain bold and italic. But making the text "flowable" is easy in Jstar. I was able to convert a 7 page Foreward from an older book in about ten minutes (including adding the italics). So whole (200 page, or so) book would probably take a few hours. That's with clean text.
Looking forward to seeing some of the OCR results. Thanks for all the information.