MobileRead Forums - View Single Post

roger64 · 07-06-2021, 03:53 AM

Quote:

Originally Posted by Sarmat89

It does diacritics?
It does italics?
It strips headers/footers?
It recognizes custom words?

I have been using Tesseract to OCR about a hundred books and still do. OCR quality (with a quality scan) is at such a good level that most of the pages have zero mistake.

In French, it recognizes "accents circonflexes" and others accents like ü and all French accents. I did not try it for other languages than English. I know that you can use it for many languages if you install the corresponding data set.

It does not do italics (it did, and maybe will do it again)
It does not strip headers and footers .

My experience is that these two points above are just a minor drawback for most fiction books. During the checking phase, after the OCR process, you need anyway a kind of "breathing time" to go to the next page.

Tesseract is not perfect. It is perfectly usable now for most fiction books.