View Single Post
Old 09-17-2018, 02:31 PM   #90
Difflugia
Testate Amoeba
Difflugia ought to be getting tired of karma fortunes by now.Difflugia ought to be getting tired of karma fortunes by now.Difflugia ought to be getting tired of karma fortunes by now.Difflugia ought to be getting tired of karma fortunes by now.Difflugia ought to be getting tired of karma fortunes by now.Difflugia ought to be getting tired of karma fortunes by now.Difflugia ought to be getting tired of karma fortunes by now.Difflugia ought to be getting tired of karma fortunes by now.Difflugia ought to be getting tired of karma fortunes by now.Difflugia ought to be getting tired of karma fortunes by now.Difflugia ought to be getting tired of karma fortunes by now.
 
Difflugia's Avatar
 
Posts: 3,049
Karma: 27300000
Join Date: Sep 2012
Device: Many Android devices, Kindle 2, Toshiba e755 PocketPC
Quote:
Originally Posted by sealbeater View Post
Yes, they are. I have never seen a PDF that contains both txt and full page images. Please, attach one so we can view it.

OCR is the last thing you want, not the first.
I've attached two-page excerpts from three commercial PDF books that I've bought. You can decide whether or not they invalidate what you've said. In case anyone cares, I used The PDF Toolkit to extract pages from the larger documents.

I'll note that PDF fonts are not fixed. For example, the first page of the "Text only.pdf" file that I linked contains the Greek phrase, ὁ υἱὸς τοῦ ἀνθρώπου. If I copy/paste that phrase, I get something far different: o" yi"oÁq toyÄ a! nurwpoy. That also happens in some English documents if the chosen font includes different glyphs for certain kerned pairs ("ff" is common). It's also possible to completely remap a font, either intentionally to hinder copy-paste or simply as a programming expedient. In those cases, OCR will give a much better result than simple text extraction. It's further possible to restore accurate copy/paste ability to such a document by adding the embedded text layer, even though there's already a "text" layer used to render the page.
Attached Files
File Type: pdf Text only.pdf (36.8 KB, 181 views)
File Type: pdf Images only.pdf (199.6 KB, 192 views)
File Type: pdf Images and text.pdf (224.4 KB, 166 views)
Difflugia is offline   Reply With Quote