MobileRead Forums - View Single Post

Difflugia · 09-17-2018, 02:31 PM

Quote:

Originally Posted by sealbeater

Yes, they are. I have never seen a PDF that contains both txt and full page images. Please, attach one so we can view it.

OCR is the last thing you want, not the first.

I've attached two-page excerpts from three commercial PDF books that I've bought. You can decide whether or not they invalidate what you've said. In case anyone cares, I used The PDF Toolkit to extract pages from the larger documents.

I'll note that PDF fonts are not fixed. For example, the first page of the "Text only.pdf" file that I linked contains the Greek phrase, ὁ υἱὸς τοῦ ἀνθρώπου. If I copy/paste that phrase, I get something far different: o" yi"oÁq toyÄ a! nurwpoy. That also happens in some English documents if the chosen font includes different glyphs for certain kerned pairs ("ff" is common). It's also possible to completely remap a font, either intentionally to hinder copy-paste or simply as a programming expedient. In those cases, OCR will give a much better result than simple text extraction. It's further possible to restore accurate copy/paste ability to such a document by adding the embedded text layer, even though there's already a "text" layer used to render the page.