View Single Post
Old 06-13-2010, 10:07 PM   #24
capidamonte
Not who you think I am...
capidamonte can even cheer up an android equipped with a defective Genuine Personality Prototype.capidamonte can even cheer up an android equipped with a defective Genuine Personality Prototype.capidamonte can even cheer up an android equipped with a defective Genuine Personality Prototype.capidamonte can even cheer up an android equipped with a defective Genuine Personality Prototype.capidamonte can even cheer up an android equipped with a defective Genuine Personality Prototype.capidamonte can even cheer up an android equipped with a defective Genuine Personality Prototype.capidamonte can even cheer up an android equipped with a defective Genuine Personality Prototype.capidamonte can even cheer up an android equipped with a defective Genuine Personality Prototype.capidamonte can even cheer up an android equipped with a defective Genuine Personality Prototype.capidamonte can even cheer up an android equipped with a defective Genuine Personality Prototype.capidamonte can even cheer up an android equipped with a defective Genuine Personality Prototype.
 
capidamonte's Avatar
 
Posts: 374
Karma: 30283
Join Date: Jan 2010
Location: Honolulu
Device: PocketBook 360 -- Ivory
Hitch,

First, this is obviously output from Sigil -- and at that, it's fairly doable with regex.

You've basically got useless, undifferentiable <span> tags marking spaces and hyphens. I ran the following in my text editor:

Code:
  1. REMOVE <p class="MsoNormal sgc-\d+"><span class="sgc-\d+">\d+(\&nbsp\;)*<span class="sgc-\d+">SALLY WRIGHT</span></span></p>
  2. REMOVE SPACEclass="MsoNormal sgc-\d+"
  3. REMOVE </i><i>
  4. REPLACE </span><span class="sgc-\d+">ing WITH ing
  5. REPLACE </span><span class="sgc-\d+"> WITH SPACE
  6. REMOVE <span class="sgc-\d+">
  7. REMOVE </span>
in the order listed.

(SPACE is a single blank space. REMOVE means replace with nothing.)

I was left with one error (Cum berland) that was found by spell-check.

Now, that's particular to this text. If your others are markedly similiar, it could work. Worst cases can be found via spell check, and extra spaces can be easily fixed via regex (s: \s\s+ r:\s )

Write all this regex as a macro, and it will take one click to fix an entire book.

One that I would worry about would be the SALLY WRIGHT header/footer -- is it consistent enough? Is there a pattern to the inconsistency?

Also, are there other identifiable prefices/suffices like ing that are recognizable?

If you're getting different, dissimilar results in every book, then I'd suggest posting a section of the actual Word or Writer HTML, as there may be loss of better identifiable patterns in translation to Sigil.

Also, Jon may have suggested a good answer; there is recently a Book Designer HTML0 to Sigil pre-processor script here in the forums that's supposed to improve importation.

cap
Attached Files
File Type: txt demo-regexed.txt (2.9 KB, 229 views)
capidamonte is offline   Reply With Quote