View Single Post
Old 09-20-2011, 03:24 PM   #7
jackie_w
Grand Sorcerer
jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.
 
Posts: 6,252
Karma: 16544692
Join Date: Sep 2009
Location: UK
Device: ClaraHD, Forma, Libra2, Clara2E, LibraCol, PBTouchHD3
@MacEvansCB,

My experience with PDF to HTML conversion may be of limited use to you, but I'll offer it anyway.

You could try using one of the utilities
  • pdftohtml.exe (freeware, also used by Calibre, I believe)
  • pdf2xml.exe (freeware used by mobipocket creator)
to convert your "nice PDFs" to XML format. Personally, I prefer the latter option as pdftohtml sometimes loses italics.

The output XML does contain positional (x, y) info for each line, namely distance from Left edge of page and distance from Top of page, so detecting paragraph indents is possible.

If you have some programming ability, with work (quite a lot of work) you can write something to parse the XML and reconstruct chapter headings, paragraphs, scene-breaks, italics, bold, smallcaps, images and hyperlinks as you convert the XML to HTML.

Even so, I have not found it to be a "single magic button" conversion process. Every PDF is different and supplying a little specific knowledge about a particular PDF can make a big difference to the quality of the resultant HTML. Also, I haven't even attempted to try and convert PDFs of technical manuals in this way, only novels.
jackie_w is offline   Reply With Quote