View Single Post
Old 10-29-2006, 02:54 PM   #8
kickaha
Member
kickaha began at the beginning.
 
kickaha's Avatar
 
Posts: 11
Karma: 10
Join Date: Aug 2006
Location: Mid-Atlantic USA
Device: Sony PRS-500 / Palm Tungsten T5
pdftohtml

I have been trying out pdftohtml to see if it would work to help generate more Reader-friendly documents, and I'm afraid it's not working well.

I haven't really studied up on the PDF file structure, but I'm beginning to think that, as a page description file format, some of the niceties are not known about, such as paragraphs.

All of the tools I've tried so far have treated every line of 'data' in the pdf file as either its own paragraph, or as something with line-breaks, but no paragraph markers. (that would be pdftohtml). Trying to subsequently munge the output text to re-group paragraphs is going to be a fairly manual operation. I'm going to try to work up an OOo macro that will let me select a block of text, and remove all line-break and paragraph markers from the middle of it, and add a paragraph tag at the end. I'll still have to go through the entire document and rebuild it, but at least I'll be able to re-flow paragraphs to reformat for the Reader screen.

I think this will be acceptable for moderately sized pdfs, but not for any book-length docs.

p.s.: I know there are web-based pdf conversion services, but a lot of my reading is proprietary information, and I can't toss it out on the internet to convert it.
kickaha is offline   Reply With Quote