View Single Post
Old 10-09-2010, 04:38 PM   #32
alecE
Addict
alecE ought to be getting tired of karma fortunes by now.alecE ought to be getting tired of karma fortunes by now.alecE ought to be getting tired of karma fortunes by now.alecE ought to be getting tired of karma fortunes by now.alecE ought to be getting tired of karma fortunes by now.alecE ought to be getting tired of karma fortunes by now.alecE ought to be getting tired of karma fortunes by now.alecE ought to be getting tired of karma fortunes by now.alecE ought to be getting tired of karma fortunes by now.alecE ought to be getting tired of karma fortunes by now.alecE ought to be getting tired of karma fortunes by now.
 
alecE's Avatar
 
Posts: 399
Karma: 546196
Join Date: Mar 2009
Location: UK canal boat
Device: sony prs505, prs650 liseuses
This may be of assistance - it's not very elegant (in fact it's totally lacking in elegance), but it works, eventually, for me:
In your favoured text editor (mine is NoteTab Lite) perform the following steps (note that the $ character is being used here to denote a space):
- replace all instances of ^P^P^P with ^P^P (triple para to double para)
- replace all instances of ^P^P with ||
- replace all instances of ^P with $
- replace all instances of || with </p>^P^P<p>
- add a leading <p> at start of document and a </p> at the end

You should now have a text that is neatly broken into paragraphs with zero hard line endings. Then:

- replace <p>' with <p>&ldquo;
- replace '</p> with &rdquo;</p>
- replace .'$ with .&rdquo;$
- ditto for comma, colon, semi-colon, query and bang followed by '$
- replace $' (or $" depending on text) with $&lsquo;
- use the above process to identify the '$ (or "$) and replace with &rsquo;$
Previous two steps should clean most of the quoted phrases ('Marie Celeste' eg and plural possessives survivors' eg)
- replace .$' with $&ldquo;
- ditto for comma, colon, semi-colon, query and bang followed by $'
- should be fairly safe now to replace all remaining instances of ' with &rsquo;

This is not bullet-proof, but does clean up the text reasonably well. I generally reckon on 1 to 2 hours doing the above plus other odds and sods before moving the text into Sigil, where the html entities will be translated into 'proper' text.
alecE is offline   Reply With Quote