Quote:
Originally Posted by HarryT
Could you not do it the same way that the text file clean-up tools work - treat two consecutive <br>'s as a paragraph break, and then delete all the others? That's all that springs to mind at present, I'm afraid!
|
Well, not all paragraphs are handled with two consecutive br tags. I converted Crime and Punishment from PDF to HTML with pdftohtml. I don't see two consecutive <br>'s anywhere.
I am thinking that if there is a period right before the <br> tag, that is the end of the paragraph. Of course it won't always be right, but that seems to be the best "guess."