MobileRead Forums - View Single Post - Best method of converting PDFs (maintaining paragraphs)

curstpriest · 10-13-2010, 03:51 PM

I've found that virtually every pdf converter program on the net uses the same open source program pdf2html http://pdftohtml.sourceforge.net/

It uses ghostscript to extract images, and it operates in one of two modes

1) extract all images and dump them inline to file, without preserving tables.
- Text comes out in paragraphs with random line breaks, and looks very ugly, tables are not preserved.

2) extract each page background as a whole image, and create each page as a table.
- All formatting is preserved.
- HTML document looks almost identical to PDF

Method 2 looks good, but won't work for ebooks because of the static background page size (no reflow)

Method 1 is used instead (but no tables are preserved)
This is the same method that Acrobat 9 uses to export HTML 3.0

Now, if your document has limited tables, and your have a simple PDF with a few columns you want to reflow or change, you can use method 2.

Use http://pdftohtml.sourceforge.net/ without images enabled.
Then convert the HTML to EPUB with tables enabled. You should get yourself a very respectable document, with intact, flowing/reflowing paragraphs that span multiple pages.

10-13-2010, 03:51 PM	#4
curstpriest Confused Posts: 402 Karma: 5538 Join Date: Oct 2010 Location: Bay Area Device: Kindle DXG	I've found that virtually every pdf converter program on the net uses the same open source program pdf2html http://pdftohtml.sourceforge.net/ It uses ghostscript to extract images, and it operates in one of two modes 1) extract all images and dump them inline to file, without preserving tables. - Text comes out in paragraphs with random line breaks, and looks very ugly, tables are not preserved. 2) extract each page background as a whole image, and create each page as a table. - All formatting is preserved. - HTML document looks almost identical to PDF Method 2 looks good, but won't work for ebooks because of the static background page size (no reflow) Method 1 is used instead (but no tables are preserved) This is the same method that Acrobat 9 uses to export HTML 3.0 Now, if your document has limited tables, and your have a simple PDF with a few columns you want to reflow or change, you can use method 2. Use http://pdftohtml.sourceforge.net/ without images enabled. Then convert the HTML to EPUB with tables enabled. You should get yourself a very respectable document, with intact, flowing/reflowing paragraphs that span multiple pages.