View Single Post
Old 10-13-2010, 02:51 PM   #4
curstpriest
Confused
curstpriest shares his or her toyscurstpriest shares his or her toyscurstpriest shares his or her toyscurstpriest shares his or her toyscurstpriest shares his or her toyscurstpriest shares his or her toyscurstpriest shares his or her toyscurstpriest shares his or her toyscurstpriest shares his or her toyscurstpriest shares his or her toyscurstpriest shares his or her toys
 
curstpriest's Avatar
 
Posts: 402
Karma: 5538
Join Date: Oct 2010
Location: Bay Area
Device: Kindle DXG
I've found that virtually every pdf converter program on the net uses the same open source program pdf2html http://pdftohtml.sourceforge.net/

It uses ghostscript to extract images, and it operates in one of two modes

1) extract all images and dump them inline to file, without preserving tables.
- Text comes out in paragraphs with random line breaks, and looks very ugly, tables are not preserved.

2) extract each page background as a whole image, and create each page as a table.
- All formatting is preserved.
- HTML document looks almost identical to PDF

Method 2 looks good, but won't work for ebooks because of the static background page size (no reflow)

Method 1 is used instead (but no tables are preserved)
This is the same method that Acrobat 9 uses to export HTML 3.0

Now, if your document has limited tables, and your have a simple PDF with a few columns you want to reflow or change, you can use method 2.

Use http://pdftohtml.sourceforge.net/ without images enabled.
Then convert the HTML to EPUB with tables enabled. You should get yourself a very respectable document, with intact, flowing/reflowing paragraphs that span multiple pages.
curstpriest is offline   Reply With Quote