MobileRead Forums - View Single Post

dgatwood · 04-01-2014, 10:32 PM

Quote:

Originally Posted by Toxaris

That tool does not exits at the moment. Apparently it is a holy grail to many. PDF is your worst format to start from.

You're inclined to understatement. Essentially, the question is approximately like asking, "How can I zoom in and enhance like they do on CSI" or "how can I copy the text of my research paper from a photograph of the screen," and for precisely the same reason—you can't readily extract information that isn't there.

As mrmikel pointed out, a PDF file basically consists of... at best, a series of strings, or at worst, a series of individual glyphs, along with font information and the location where each glyph or string should be drawn on the page. You don't have paragraphs, and you may or may not even have entire lines. This is why copy and paste from a PDF is notoriously error-prone.

One of the most hilarious examples of PDF's inadequacy that I've seen involved Apple's developer documentation PDFs from a few years back. In some PDF readers (notably, Apple's Preview prior to about OS X v10.8), depending on how you selected text, you would sometimes select the words, but not the spaces between them. You can probably imagine how much fun that was.

Worse, depending on how the PDF was created, there's no guarantee that it contains the mapping information needed to convert glyph IDs back into a Unicode code points. If it doesn't, then copying text from the PDF could return nothing, random garbage, or anything in between. So in that case, the question is more like asking how to retrieve your research paper from a photo of a Microsoft Word BSOD....