View Single Post
Old 12-19-2010, 01:22 PM   #5
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
There is no concept of a 'paragraph' in pdfs.

PDF unwrap works based off of punctuation, html that pdftohtml generates doesn't provide any other clues as to what is and is not a paragraph. You just need to fix it up yourself after conversion unfortunately.

There is a new pdf engine that will probably get released someday which contains info such as indentation and spacing between lines, which could be used to determine paragraph boundaries. No telling when it will be ready though.
ldolse is offline   Reply With Quote