MobileRead Forums - View Single Post - Converting Word-> HTML -> Epub

kovidgoyal · 07-18-2012, 07:31 AM

It doesn't matter how good your "understanding" of PDF is. The difficulty in converting PDF does not come from the obscurity of the format. It comes from the nature of the format. I have discoursed in length on that elsewhere, so I am not going to repeat myself in detail here. But suffice it to say that PDF is not a semantic format. A PDF file (typically) contains instructions that look like draw character #1234 from font xyz at position (x, y) on the page. The PDF file (unless it is tagged) has no semantic info at all. It has no concept of semantic units like words, sentences, paragraphs, tables, lists, etc. That means that an attempt to convert it to HTML can follow one of two paths:

1) Use non semantic HTML (i.e. just replicate the PDF drawing instructions with some form of absolute positioned HTML)

2) Use statistical analysis to re-organize the text from the PDF into semantic units.

As far as (1) is concerned there are already dozens of perfectly good tools that do this. However, the resulting HTML is not reflowable and is useless as far as small screened devices are concerned.

(2) suffers from the problems of statistical analysis. It can never be absolutely accurate. So it will mis identify text sections, sentences, words, headers, footers and so on, in some percentage of cases. Some tools that follow this approach and try to work with arbitrary PDFs are: pdftoxml (used by Amazon), pdftohtml from poppler (used by calibre), PDFMiner, and a couple of others. None of them work well on any significant subset of PDFs.

If you claim that your tool can convert arbitrary PDF into HTML losslessly, then it is, I suspect, an implementation of (1) and as such not very interesting.

07-18-2012, 07:31 AM	#30
kovidgoyal creator of calibre Posts: 45,438 Karma: 27757438 Join Date: Oct 2006 Location: Mumbai, India Device: Various	It doesn't matter how good your "understanding" of PDF is. The difficulty in converting PDF does not come from the obscurity of the format. It comes from the nature of the format. I have discoursed in length on that elsewhere, so I am not going to repeat myself in detail here. But suffice it to say that PDF is not a semantic format. A PDF file (typically) contains instructions that look like draw character #1234 from font xyz at position (x, y) on the page. The PDF file (unless it is tagged) has no semantic info at all. It has no concept of semantic units like words, sentences, paragraphs, tables, lists, etc. That means that an attempt to convert it to HTML can follow one of two paths: 1) Use non semantic HTML (i.e. just replicate the PDF drawing instructions with some form of absolute positioned HTML) 2) Use statistical analysis to re-organize the text from the PDF into semantic units. As far as (1) is concerned there are already dozens of perfectly good tools that do this. However, the resulting HTML is not reflowable and is useless as far as small screened devices are concerned. (2) suffers from the problems of statistical analysis. It can never be absolutely accurate. So it will mis identify text sections, sentences, words, headers, footers and so on, in some percentage of cases. Some tools that follow this approach and try to work with arbitrary PDFs are: pdftoxml (used by Amazon), pdftohtml from poppler (used by calibre), PDFMiner, and a couple of others. None of them work well on any significant subset of PDFs. If you claim that your tool can convert arbitrary PDF into HTML losslessly, then it is, I suspect, an implementation of (1) and as such not very interesting.