MobileRead Forums - View Single Post - no text extraction for pdf with images and OCR

BetterRed · 05-02-2013, 09:44 PM

Quote:

Originally Posted by fxp33

Thank you for testing it. I really thought the pdf was already OCRed with omnipage or another programm, to allow the copy of text "behind" the image.

... So, for you, there is, at the moment, no specific parameter for Calibre to convert a bunch of pdf files of this kind (with the heuristics arguments to avoid doing by hand all the tidying up)?

Thanks again for any hint.

François

@François - I should have written - I think some pdf readers... include lightweight OCR...

I'm not aware that the 'text' can be inserted 'behind' the image within the PDF itself. But I am not as up to speed with these issues as I once was - so perhaps that is what's happening.

I ran your PDF through the MobiCreator PDF converter - I've put the output into an attached zip - its interesting, even if not very useful - but it does at least contain the text

I'm told that the Google on-line OCR PDF scanner is as good or better than the some of the free PDF OCR scanners, but I don't know if it does bulk scanning. Given that you're looking at 'old documents', it is possible that Google have already OCR'd them - maybe the University would know that.

I also rescanned the PDF with Omnipage, after doing some tweaks to settings I was able to remove page numbers and improve some other things, but the output was still in need of 'tidying up'. I doubt there is a solution to that, Project Gutenberg uses volunteer proof readers to do the tidying up, I'm not sure what Google does.

BR