Quote:
Originally Posted by WarnerYoung
I'm not sure that's strictly true. It depends on how the PDF file was generated. Otherwise, a standard PDF reader wouldn't be able to let you select and copy its text, or search through the text in its files. Or am I missing something here?
|
Even with text-based PDFs, the PDF does not (necessarily) contain information about words, paragraphs, etc. The characters are easy to extract (unless there are funny fonts involved) but joining hyphenated words at the end of line, putting spaces where they belong, removing page numbres and headers, dealing with footnotes, putting columns in the right order, detecting paragraphs, etc. is a different matter.