View Single Post
Old 07-16-2020, 07:11 AM   #23
roger64
Wizard
roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.
 
Posts: 2,625
Karma: 3120635
Join Date: Jan 2009
Device: Kindle PW3 (wifi)
Quote:
Originally Posted by Sarmat89 View Post
Please, never-never-never use Tesseract or other headless OCR systems for books. All text must be proofed interactively. Also, that frontend is very primitive and it will mess up the text formatting.
After over one year of exclusive use of Tesseract (about 50 ebooks), I strongly disagree.

Tesseract 4.11, coupled with the latest tessdata 2.4. (ENG and FRA tested) is quite able to ocr efficiently any book.

With a good quality scan, you can even ocr directly a full book (about 30 pages minute) and save in text format. The graphic interface (gImageReader-qt5) is quite clean.
- first you can proofread your text line by line
- with a click, the text is changed into paragraphs interspersed with empty ones.
Roughly I would say, on average, you may have one mistake a page (including accents, punctuation).

Cons

No italics, no anchors that need to be set up manually.
Garbage for full white pages (?)

Free tip

If you have a white text on a black background, Tesseract will give you a blank page. So, open a terminal and use imagemagick first with this command (adapt as needed), then proceed as usual.

Code:
convert name-image.jpg -channel RGB -negate output.jpg

Last edited by roger64; 07-23-2020 at 06:54 AM. Reason: had forgotten convert...
roger64 is offline   Reply With Quote