View Single Post
Old 10-15-2011, 03:13 PM   #23
amoroso
Groupie
amoroso ought to be getting tired of karma fortunes by now.amoroso ought to be getting tired of karma fortunes by now.amoroso ought to be getting tired of karma fortunes by now.amoroso ought to be getting tired of karma fortunes by now.amoroso ought to be getting tired of karma fortunes by now.amoroso ought to be getting tired of karma fortunes by now.amoroso ought to be getting tired of karma fortunes by now.amoroso ought to be getting tired of karma fortunes by now.amoroso ought to be getting tired of karma fortunes by now.amoroso ought to be getting tired of karma fortunes by now.amoroso ought to be getting tired of karma fortunes by now.
 
amoroso's Avatar
 
Posts: 185
Karma: 1004070
Join Date: Jul 2010
Location: Italy
Device: Kindle for Android, Google Play Books
Quote:
Originally Posted by tentimes View Post
I am confused about why there is no program out there that can take the textual information in a pdf book, plus the index (bookmarks) and turn it into a an indexed book.
A PDF document is a software program containing instructions written in a restricted subset of the PostScript document description language, which is a full blown stack-based programming language. Extracting text from a PDF document is difficult because it is not stored in specific sections of the file, but scattered in difficult to predict ways among the instructions that generate the document layout. Consider for example the following pseudocode fragments that print the same "book" string:
Quote:
PRINT "book"

PRINT "b" + "o" + "o" + "k"

PRINT "bo" + "o" + "k"

string = { "b", "o", "o", "k" }
FOR i IN string
PRINT string($i)
There are countless more equivalent code fragments, all different, that generate the same text. Each of the fragments includes in a different way the text or part of it. Extracting the text, and even locating it, is difficult. Something similar happens with instructions in a PDF file. A PDF conversion utility is actually a program-analyzing tool.
amoroso is offline   Reply With Quote