MobileRead Forums - View Single Post - no text extraction for pdf with images and OCR

wubuer · 12-15-2015, 08:22 AM

Quote:

Originally Posted by BetterRed

@François - I should have written - I think some pdf readers... include lightweight online OCR...

I'm not aware that the 'text' can be inserted 'behind' the image within the PDF itself. But I am not as up to speed with these issues as I once was - so perhaps that is what's happening.

I ran your PDF through the MobiCreator PDF converter - I've put the output into an attached zip - its interesting, even if not very useful - but it does at least contain the text

I'm told that the Google on-line OCR PDF scanner is as good or better than the some of the free PDF OCR scanners, but I don't know if it does bulk scanning. Given that you're looking at 'old documents', it is possible that Google have already OCR'd them - maybe the University would know that.

I also rescanned the PDF with Omnipage, after doing some tweaks to settings I was able to remove page numbers and improve some other things, but the output was still in need of 'tidying up'. I doubt there is a solution to that, Project Gutenberg uses volunteer proof readers to do the tidying up, I'm not sure what Google does.

BR

thanks for your information, it's useful.