View Single Post
Old 02-22-2012, 05:47 PM   #4
alecE
Evangelist
alecE ought to be getting tired of karma fortunes by now.alecE ought to be getting tired of karma fortunes by now.alecE ought to be getting tired of karma fortunes by now.alecE ought to be getting tired of karma fortunes by now.alecE ought to be getting tired of karma fortunes by now.alecE ought to be getting tired of karma fortunes by now.alecE ought to be getting tired of karma fortunes by now.alecE ought to be getting tired of karma fortunes by now.alecE ought to be getting tired of karma fortunes by now.alecE ought to be getting tired of karma fortunes by now.alecE ought to be getting tired of karma fortunes by now.
 
alecE's Avatar
 
Posts: 412
Karma: 546196
Join Date: Mar 2009
Location: UK canal boat
Device: sony prs505, prs650, kobo Glo HD liseuses
I'm incompetent in Regex, so I have a fairly laborious procedure, which gets done in Notepad++ after any necessary scanning/OCR processes and cleaning-up line breaks:
(I prefer double quotes for speech, single quotes for abbreviations, apostrophes etc)
-insert <p> at start of first line;
-change all carriage-return/new-lines to </p>\r\n\r\n<p>;
- insert </p> at end of last line;
-change all <p>" to <p>&ldquo;
-change all "</p> to &rdquo;</p>;
-change all ^"space to ^&rdquo;space (where ^ may be stop, comma, query or bang);
-change all ^space" to ^space&ldquo; (where ^ may be stop, comma, query, bang, colon or semi-colon);
-by now the number of instances of spacequote and quotespace should be sufficiently few to permit individual search/replace with double or single quotes as required - several passes may be required.
- run through, tracking down the last few instances of quotes, then do a mega replace of single quotes with &rsquo; for the abbreviations.
-sort out the ndashes and ellipses;
Tedious, but it gets me there in the end - a typical SF book of 8 signatures will take me 3 to 6 hours to read, correct and edit, i.e. from OCR-produced text file through to Sigil-ready html.

I find it pays to use named entities - it's particularly helpful when converting a text that has single quotes for direct speech into double-quoted speech marks. I suspect there are various magic formulas in Regex which could do the job as well. If I can find a few spare brain cells one day, I may try going down that route.
Bottom line - no easy solution
alecE is offline   Reply With Quote