MobileRead Forums - View Single Post - Having trouble OCR-ing a 70MB pdf file

retiredbiker · 07-19-2024, 12:02 PM

Quote:

Originally Posted by Quoth

... and also the IA images need "cleaned" first.

Interesting that over the past few years my setup using Tesseract with OCRFeeder as a front end has become considerably better on old book images. Google has been developing it recently, and I understand a new AI/neural network bit has been added, but only for some detail, IIRR.

While old typesetting and generally poor images still cause many errors - especially punctuation - Tesseract can sometimes read words that I struggle to figure out. I rarely have to clean up AI images any more, unless they are really badly tilted, keystoned, or have something geometrically wrong. One recent book, Tesseract was doing fine, but I had to clean up the images so I could read them for proofing!