View Single Post
Old 08-14-2023, 02:49 AM   #24
michaelbr
Connoisseur
michaelbr began at the beginning.
 
michaelbr's Avatar
 
Posts: 81
Karma: 10
Join Date: Aug 2010
Location: Murcia/Spain
Device: Android 12
Quote:
Originally Posted by retiredbiker View Post
I use Ubuntu.

For OCR, try OCRFeeder as a front end to tesseract. Tesseract is very accurate given a good image. I do a page at a time, defining the text area manually. It can then handle multiple columns, advertisements, "continued on page 99" and so on. OCRFeeder is very good at connecting the lines into correct paragraphs, dealing with end-of-line hyphens, and so on.

Might seem slow, but this as actually the quick part of the process...you will have to proof read and correct no matter what.

Pdftopng will get images out of pdfs, that works better than OCRing the pdf itself.

ImageMagick can tame image files that are too large and slow down tesseract. Scan Taylor Advanced and Unpaper may be useful; I find them black magic, but I use them if needed.

If you want to try and use existing text, pdftohtml will sometimes fail while pdftotext will work. No idea why. If you use the pdftotext, try the --layout option and get ready for a lot of regex to tame the spacing.
Thanks for these tips, will try to install these tools and learn/try it.
michaelbr is offline   Reply With Quote