View Single Post
Old 02-23-2010, 02:24 PM   #13
iodine9176
Junior Member
iodine9176 began at the beginning.
 
Posts: 5
Karma: 10
Join Date: Feb 2010
Device: stanza
Quote:
Originally Posted by charleski View Post
Actually, I realised that Word's html export has a few other flaws. You'll probably want to run the html through HTML Tidy or something similar to fix all the flaws (mostly, Word fails to put quotes around attribute values). Notepadd++'s TextFX plugin can do the HTML Tidy job for you.

Word does add a lot of needless fluff, like spans to define the language, those are a pain to remove in Notepad++ as its regex engine doesn't handle newlines or non-greedy matches. Sigil, OTOH, has a regex engine that will remove them easily - set regular expression and minimal matching, Find string
<span xml:lang="EN-US" lang="EN-US">(.*)</span>
Replace string
\1
Sigil also automatically does the HTML Tidy xhtml conversion for you.
I agree that Sigil is good to tidy up and convert the html files. However, the problem is that i have many chapters in my doc. It is hard to convert the html files and tidy up them one by one. Are there any programmatic methods which can do all the conversion and tidying work all toether?
iodine9176 is offline   Reply With Quote