Hitch,
First, this is obviously output from Sigil -- and at that, it's fairly doable with regex.
You've basically got useless, undifferentiable <span> tags marking spaces and hyphens. I ran the following in my text editor:
Code:
- REMOVE <p class="MsoNormal sgc-\d+"><span class="sgc-\d+">\d+(\ \;)*<span class="sgc-\d+">SALLY WRIGHT</span></span></p>
- REMOVE SPACEclass="MsoNormal sgc-\d+"
- REMOVE </i><i>
- REPLACE </span><span class="sgc-\d+">ing WITH ing
- REPLACE </span><span class="sgc-\d+"> WITH SPACE
- REMOVE <span class="sgc-\d+">
- REMOVE </span>
in the order listed.
(SPACE is a single blank space. REMOVE means replace with nothing.)
I was left with one error (
Cum berland) that was found by spell-check.
Now, that's particular to this text. If your others are markedly similiar, it could work. Worst cases can be found via spell check, and extra spaces can be easily fixed via regex (s: \s\s+ r:\s )
Write all this regex as a macro, and it will take one click to fix an entire book.
One that I would worry about would be the SALLY WRIGHT header/footer -- is it consistent enough? Is there a pattern to the inconsistency?
Also, are there other identifiable prefices/suffices like
ing that are recognizable?
If you're getting different, dissimilar results in every book, then I'd suggest posting a section of the actual Word or Writer HTML, as there may be loss of better identifiable patterns in translation to Sigil.
Also, Jon may have suggested a good answer; there is recently a Book Designer HTML0 to Sigil pre-processor script here in the forums that's supposed to improve importation.
cap