MobileRead Forums - View Single Post - Only Convert PDFs with embedded OCRed text to EPUB?

Geremia · 12-24-2012, 03:33 PM

Quote:

Originally Posted by dgatwood

Better to use something like pdftotext and see if it returns nothing. PDF files might contain both images *and* text, and I'm assuming you probably want to convert those as well.

Oh yes, that's a good point. What I wrote above only finds pure-text PDFs, not mixed text/image ones like the PDFs from Archive.org. I don't think I have many, if any, mixed text/image PDFs, but all my DJVU books are that way. PDFs from Google Books or HathiTrust are mostly images, but they do have a small amount of text for copyright, etc., so making a script ignore that would be more complex.