pdftohtml
I have been trying out pdftohtml to see if it would work to help generate more Reader-friendly documents, and I'm afraid it's not working well.
I haven't really studied up on the PDF file structure, but I'm beginning to think that, as a page description file format, some of the niceties are not known about, such as paragraphs.
All of the tools I've tried so far have treated every line of 'data' in the pdf file as either its own paragraph, or as something with line-breaks, but no paragraph markers. (that would be pdftohtml). Trying to subsequently munge the output text to re-group paragraphs is going to be a fairly manual operation. I'm going to try to work up an OOo macro that will let me select a block of text, and remove all line-break and paragraph markers from the middle of it, and add a paragraph tag at the end. I'll still have to go through the entire document and rebuild it, but at least I'll be able to re-flow paragraphs to reformat for the Reader screen.
I think this will be acceptable for moderately sized pdfs, but not for any book-length docs.
p.s.: I know there are web-based pdf conversion services, but a lot of my reading is proprietary information, and I can't toss it out on the internet to convert it.
|