Quote:
Originally Posted by tentimes
I am confused about why there is no program out there that can take the textual information in a pdf book, plus the index (bookmarks) and turn it into a an indexed book.
|
A PDF document is a software program containing instructions written in a restricted subset of the PostScript document description language, which is a full blown stack-based programming language. Extracting text from a PDF document is difficult because it is not stored in specific sections of the file, but scattered in difficult to predict ways among the instructions that generate the document layout. Consider for example the following pseudocode fragments that print the same "book" string:
Quote:
PRINT "book"
PRINT "b" + "o" + "o" + "k"
PRINT "bo" + "o" + "k"
string = { "b", "o", "o", "k" }
FOR i IN string
PRINT string($i)
|
There are countless more equivalent code fragments, all different, that generate the same text. Each of the fragments includes in a different way the text or part of it. Extracting the text, and even locating it, is difficult. Something similar happens with instructions in a PDF file. A PDF conversion utility is actually a program-analyzing tool.