The first 7 lines specified in the xml file are:
<text top="45" left="130" width="184" height="18" font="0">C H A P T E R S I X </text>
<text top="205" left="310" width="5" height="18" font="0"> </text>
<text top="591" left="209" width="34" height="14" font="2"> 88 </text>
<text top="221" left="79" width="289" height="25" font="3">THE JOURNEY FROM </text>
<text top="248" left="106" width="236" height="25" font="3">PLATFORM NINE </text>
<text top="275" left="66" width="315" height="25" font="3">AND THREE-QUARTERS </text>
<text top="298" left="112" width="3" height="17" font="4"> </text>
As presented, the page number 88 is specified on the 3rd line above, but is actually the last line of page 1. I have not looked at the source code for pdfreflow, but the actual line order that it should have decoded from the xml are the top locations 45, 205, 221, 248, 275, etc. However, the line heights of the sequence:
THE JOURNEY FROM
PLATFORM NINE
AND THREE-QUARTERS
cause the rendered sequence to be
THE JOURNEY FROM
AND THREE-QUARTERS
PLATFORM NINE
Just manually edit the xml file, and change the font size from 3 to 2 (in these lines), and then it will be in order again. Manually reorder the lines at top="299" and top="591". At top="463", there is a line height jump to 20 instead of the usual +16 due to an oversized font.
After analyzing the xml file (which is a product of pdftohtml), I can sympathize with those developers who are working on pdf re-flowers. They seem to have to do some form of layout decoding and correction, as well as sorting and correction, to produce a perfect result.
Last edited by Fat Abe; 06-03-2010 at 11:08 PM.
|