Thread: PDF line unwrap
View Single Post
Old 05-26-2010, 05:18 PM   #15
miquel
Junior Member
miquel began at the beginning.
 
miquel's Avatar
 
Posts: 7
Karma: 10
Join Date: May 2010
Location: Heidelberg, Germany
Device: Amazon Kindle 2
OK Kovid, I'd like to confirm a couple of things with you, please
The new pdf engine:

1. Takes the pdf file, and passes it to the C plugin implementation of PDF reflow. That returns an xml with the pdf's draw commands (a pdf in xml if you will)

2. PDFDocument takes the xml and generates the html that's used as a base for conversion

3. The rest of ebook conversion takes the html into whatever other format is needed

My plan would then be to hack into PDFDocument, take the xml, do the unwrapping and header+footer detection, and end up making the html there.

Is that what you had in mind? Or did you intend the reflow plugin to, you know, reflow (ie unwrap) the pdf? I personally prefer pdfreflow being a pdf-to-xml-that-we-can-work-on-in-python converter

Did I get it right? What did you have in mind?
Thanks!
miquel is offline   Reply With Quote