View Single Post
Old 06-03-2010, 11:03 PM   #25
Fat Abe
Man Who Stares at Books
Fat Abe ought to be getting tired of karma fortunes by now.Fat Abe ought to be getting tired of karma fortunes by now.Fat Abe ought to be getting tired of karma fortunes by now.Fat Abe ought to be getting tired of karma fortunes by now.Fat Abe ought to be getting tired of karma fortunes by now.Fat Abe ought to be getting tired of karma fortunes by now.Fat Abe ought to be getting tired of karma fortunes by now.Fat Abe ought to be getting tired of karma fortunes by now.Fat Abe ought to be getting tired of karma fortunes by now.Fat Abe ought to be getting tired of karma fortunes by now.Fat Abe ought to be getting tired of karma fortunes by now.
 
Fat Abe's Avatar
 
Posts: 1,826
Karma: 10606722
Join Date: Mar 2010
Location: 50th State, USA. Also, PA, NY, CA, and elsewhere.
Device: All of the Above
The first 7 lines specified in the xml file are:


<text top="45" left="130" width="184" height="18" font="0">C H A P T E R S I X </text>
<text top="205" left="310" width="5" height="18" font="0"> </text>
<text top="591" left="209" width="34" height="14" font="2"> 88 </text>
<text top="221" left="79" width="289" height="25" font="3">THE JOURNEY FROM </text>
<text top="248" left="106" width="236" height="25" font="3">PLATFORM NINE </text>
<text top="275" left="66" width="315" height="25" font="3">AND THREE-QUARTERS </text>
<text top="298" left="112" width="3" height="17" font="4"> </text>

As presented, the page number 88 is specified on the 3rd line above, but is actually the last line of page 1. I have not looked at the source code for pdfreflow, but the actual line order that it should have decoded from the xml are the top locations 45, 205, 221, 248, 275, etc. However, the line heights of the sequence:

THE JOURNEY FROM
PLATFORM NINE
AND THREE-QUARTERS

cause the rendered sequence to be


THE JOURNEY FROM
AND THREE-QUARTERS
PLATFORM NINE

Just manually edit the xml file, and change the font size from 3 to 2 (in these lines), and then it will be in order again. Manually reorder the lines at top="299" and top="591". At top="463", there is a line height jump to 20 instead of the usual +16 due to an oversized font.

After analyzing the xml file (which is a product of pdftohtml), I can sympathize with those developers who are working on pdf re-flowers. They seem to have to do some form of layout decoding and correction, as well as sorting and correction, to produce a perfect result.
Attached Files
File Type: xml HP2page.xml (6.8 KB, 610 views)

Last edited by Fat Abe; 06-03-2010 at 11:08 PM.
Fat Abe is offline   Reply With Quote