View Single Post
Old 02-01-2008, 01:20 PM   #14
dcalder
Zealot
dcalder knows what is on the back of the AURYN.dcalder knows what is on the back of the AURYN.dcalder knows what is on the back of the AURYN.dcalder knows what is on the back of the AURYN.dcalder knows what is on the back of the AURYN.dcalder knows what is on the back of the AURYN.dcalder knows what is on the back of the AURYN.dcalder knows what is on the back of the AURYN.dcalder knows what is on the back of the AURYN.dcalder knows what is on the back of the AURYN.dcalder knows what is on the back of the AURYN.
 
Posts: 127
Karma: 9856
Join Date: Dec 2007
Location: Ontario, Canada
Device: Sony PRS-300/Kindle Keyboard/iPad Mini
Anybody got a sample to test? I've got a sneaking suspicion that WordPerfect would handle it better/more easily - Reveal Codes is the WP user's friend. I've cleaned up plenty of ASCII-text from mailing list posts by running it through WP. What I'd suspect would work for the PDF would be to either open it in WP or cut-&-paste it in, turn on Reveal Codes, see what codes are being used at the end of lines versus end/beginning of paragraphs, and go from there. Regardless of whether there's a blank line between paragraphs or if the paragraphs are indicated by indentation alone, there will be something unique about the coding that separates them. Search and replace that with some sort of unique indicator word/phrase. Then search and replace the hard line feeds with the soft line feed code. One more search and replace to turn the indicator back into the proper paragraph separation code, then a quick once-over to confirm that things look good.

At that point, I'd probably run a macro to just go ahead and do the HTML conversion (mainly just a series of search-&-replaces to replace WP code with HTML for bold, italic, underline, etc.), add in any desired extra HTML coding, then save out as plain text. Rename the txt file to html and you're good to go. Why not just let WP save as HTML, you may ask. Simple - the same reason that I highly recommend not letting Word save as HTML - they both do a lousy job and include way too much unnecessary junk.
dcalder is offline   Reply With Quote