MobileRead Forums - View Single Post

roger64 · 07-16-2020, 07:11 AM

Quote:

Originally Posted by Sarmat89

Please, never-never-never use Tesseract or other headless OCR systems for books. All text must be proofed interactively. Also, that frontend is very primitive and it will mess up the text formatting.

After over one year of exclusive use of Tesseract (about 50 ebooks), I strongly disagree.

Tesseract 4.11, coupled with the latest tessdata 2.4. (ENG and FRA tested) is quite able to ocr efficiently any book.

With a good quality scan, you can even ocr directly a full book (about 30 pages minute) and save in text format. The graphic interface (gImageReader-qt5) is quite clean.
- first you can proofread your text line by line
- with a click, the text is changed into paragraphs interspersed with empty ones.
Roughly I would say, on average, you may have one mistake a page (including accents, punctuation).

Cons

No italics, no anchors that need to be set up manually.
Garbage for full white pages (?)

Free tip

If you have a white text on a black background, Tesseract will give you a blank page. So, open a terminal and use imagemagick first with this command (adapt as needed), then proceed as usual.

Code:

convert name-image.jpg -channel RGB -negate output.jpg