View Single Post
Old 02-21-2010, 08:33 AM   #11
charleski
Wizard
charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.
 
Posts: 1,196
Karma: 1281258
Join Date: Sep 2009
Device: PRS-505
Actually, I realised that Word's html export has a few other flaws. You'll probably want to run the html through HTML Tidy or something similar to fix all the flaws (mostly, Word fails to put quotes around attribute values). Notepadd++'s TextFX plugin can do the HTML Tidy job for you.

Word does add a lot of needless fluff, like spans to define the language, those are a pain to remove in Notepad++ as its regex engine doesn't handle newlines or non-greedy matches. Sigil, OTOH, has a regex engine that will remove them easily - set regular expression and minimal matching, Find string
<span xml:lang="EN-US" lang="EN-US">(.*)</span>
Replace string
\1
Sigil also automatically does the HTML Tidy xhtml conversion for you.
charleski is offline   Reply With Quote