Quote:
Originally Posted by Starson17
I finally had a chance to try the OCR in Acrobat. As chaley says, it leaves multiple tiny images of the text, so the result as a pdf is highly readable - all you see are the original images of the text.
Highlighting and pasting into a txt document shows the OCR'd text only. In my tests, the results were pretty bad. It was only marginally readable as pure OCR'd text. Headings in an italicized different font were completely unreadable. Some words were split up, etc.
I suspect there is a site somewhere that will tell you how to remove all the text images, and replace them with the associated OCR'd true text. Ther muist be some way to do it. I hoped I'd find such a feature in Acrobat, but so far, no luck. Even if I found it, it would take a lot of work to get cleaned up.
|
Thanks Starson,
Yep, I've had a similar experience.
I tried 2 different things - one was really really bad, the other one worked marginally.
In my first test I used Nuance Omnipage Professional. I have the older v16, so maybe the new v17 is better but I doubt it.
This tool did an ok enough job OCRing. The problem was that it was really really dumb. It treated place names on a map as text to be OCRed and cleaned up. So you ended up with a copy where the text was fine and readable but the images were all mangled. Also in my tests Omnipage made the files between 2 and 6 times bigger while mangling them. Ouch.
In my second test I did something I should have done to start off. I used Acrobat Standard edition to export the file as RTF, then used Scansoft PDF Create to convert it back to a PDF. When I then used the Reduce Size option in Acrobat Standard it shrank from an original 59 MB to a final 12.7 MB, which is a nice improvement even though still big.
Final PDF looks real nice too, with one problem. The export to RTF step crops the right side of the page for some reason, probably related to page border settings, so I lost the right edge and a couple of letters on the right.
Looks promising, but not quite there yet.