Quote:
Originally Posted by Sarmat89
Please, never-never-never use Tesseract or other headless OCR systems for books. All text must be proofed interactively. Also, that frontend is very primitive and it will mess up the text formatting.
|
After over one year of exclusive use of Tesseract (about 50 ebooks), I strongly disagree.
Tesseract 4.11, coupled with the latest
tessdata 2.4. (ENG and FRA tested) is quite able to ocr efficiently any book.
With a
good quality scan, you can even ocr directly a full book (about 30 pages minute) and save in text format. The graphic interface (gImageReader-qt5) is quite clean.
- first you can proofread your text line by line
- with a click, the text is changed into paragraphs interspersed with empty ones.
Roughly I would say, on average, you may have one mistake a page (including accents, punctuation).
Cons
No italics, no anchors that need to be set up manually.
Garbage for full white pages (?)
Free tip
If you have a white text on a black background, Tesseract will give you a blank page. So, open a terminal and use imagemagick first with this command (adapt as needed), then proceed as usual.
Code:
convert name-image.jpg -channel RGB -negate output.jpg