MobileRead Forums - View Single Post

iodine9176 · 02-23-2010, 02:24 PM

Quote:

Originally Posted by charleski

Actually, I realised that Word's html export has a few other flaws. You'll probably want to run the html through HTML Tidy or something similar to fix all the flaws (mostly, Word fails to put quotes around attribute values). Notepadd++'s TextFX plugin can do the HTML Tidy job for you.

Word does add a lot of needless fluff, like spans to define the language, those are a pain to remove in Notepad++ as its regex engine doesn't handle newlines or non-greedy matches. Sigil, OTOH, has a regex engine that will remove them easily - set regular expression and minimal matching, Find string
<span xml:lang="EN-US" lang="EN-US">(.*)</span>
Replace string
\1
Sigil also automatically does the HTML Tidy xhtml conversion for you.

I agree that Sigil is good to tidy up and convert the html files. However, the problem is that i have many chapters in my doc. It is hard to convert the html files and tidy up them one by one. Are there any programmatic methods which can do all the conversion and tidying work all toether?