MobileRead Forums - View Single Post

chaley · 01-16-2011, 07:33 AM

Although I can't know without seeing the PDF (actually, the postscript in the PDF), I am virtually certain that the text is being placed on the image using absolute positioning. This is the same method that multi-column uses to place text on the page so it is visually correct.

Back in the days when I was doing raw postscript, I saw documents where the order of the text in the document had zero relationship with the visual order. For example, some postscript generators do bold and shadow by laying the same character down twice, a point or two apart. If you look at the text, you see two characters. Some others do headings after the text, repositioning the heading so that it prints in the right place. I have even seen some where every other line was rendered backwards to aid with justification and avoiding "rivers of white". Capturing text from such a document would be a challenge.

The thing to remember with PDF: what you see on the page may have nothing to do with flow of text in the source. The more complex the formatting, the more likely this is to be true.

01-16-2011, 07:33 AM	#6
chaley Grand Sorcerer Posts: 12,475 Karma: 8025702 Join Date: Jan 2010 Location: Notts, England Device: Kobo Libra 2	Although I can't know without seeing the PDF (actually, the postscript in the PDF), I am virtually certain that the text is being placed on the image using absolute positioning. This is the same method that multi-column uses to place text on the page so it is visually correct. Back in the days when I was doing raw postscript, I saw documents where the order of the text in the document had zero relationship with the visual order. For example, some postscript generators do bold and shadow by laying the same character down twice, a point or two apart. If you look at the text, you see two characters. Some others do headings after the text, repositioning the heading so that it prints in the right place. I have even seen some where every other line was rendered backwards to aid with justification and avoiding "rivers of white". Capturing text from such a document would be a challenge. The thing to remember with PDF: what you see on the page may have nothing to do with flow of text in the source. The more complex the formatting, the more likely this is to be true.