View Single Post
Old 06-14-2010, 02:58 AM   #25
Hitch
Bookmaker & Cat Slave
Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.
 
Hitch's Avatar
 
Posts: 11,503
Karma: 158448243
Join Date: Apr 2010
Location: Phoenix, AZ
Device: K2, iPad, KFire, PPW, Voyage, NookColor. 2 Droid, Oasis, Boox Note2
Quote:
Originally Posted by capidamonte View Post
Hitch,

First, this is obviously output from Sigil -- and at that, it's fairly doable with regex.

You've basically got useless, undifferentiable <span> tags marking spaces and hyphens. I ran the following in my text editor:

Code:
  1. REMOVE <p class="MsoNormal sgc-\d+"><span class="sgc-\d+">\d+(\&nbsp\;)*<span class="sgc-\d+">SALLY WRIGHT</span></span></p>
  2. REMOVE SPACEclass="MsoNormal sgc-\d+"
  3. REMOVE </i><i>
  4. REPLACE </span><span class="sgc-\d+">ing WITH ing
  5. REPLACE </span><span class="sgc-\d+"> WITH SPACE
  6. REMOVE <span class="sgc-\d+">
  7. REMOVE </span>
in the order listed.

(SPACE is a single blank space. REMOVE means replace with nothing.)

I was left with one error (Cum berland) that was found by spell-check.

Now, that's particular to this text. If your others are markedly similiar, it could work. Worst cases can be found via spell check, and extra spaces can be easily fixed via regex (s: \s\s+ r:\s )

Write all this regex as a macro, and it will take one click to fix an entire book.

One that I would worry about would be the SALLY WRIGHT header/footer -- is it consistent enough? Is there a pattern to the inconsistency?

Also, are there other identifiable prefices/suffices like ing that are recognizable?

If you're getting different, dissimilar results in every book, then I'd suggest posting a section of the actual Word or Writer HTML, as there may be loss of better identifiable patterns in translation to Sigil.

Also, Jon may have suggested a good answer; there is recently a Book Designer HTML0 to Sigil pre-processor script here in the forums that's supposed to improve importation.

cap
Hi, cap:

First, big-time thanks to you. Second, somehow, I managed to not provide the segment of code I thought I was providing; there are innumerable instances of the "Cumberland" problem in the full book (and all the other books produced by this process), in which some closing spans-opening spans should be spaces, some should be nothing, but not all "nothings" have recognizable suffixes; they're simply where the word was broken in the original justified text. Maybe a regex s&r would work better than deleting all the "optional hyphens" in Word. Hmmm.

The page headers--the Title and the Author headers--really concern me the least of my issues. It's the spans that are driving me daft. I can, and have, "fix" this by doing the whole bloody book manually--it's about a 3-hour job--but I'd rather not, mostly because once I've stripped all the formatting, I worry that I'll miss italicization or some other small thing. of course, any time I can do something faster, I'm happy with that, also.

I'm going to try the Bookdesigner thing, but I admit that I don't see how a template will "fix" the spans issue, which is created entirely by Word in its attempts to interpret what it's been fed by the OCR. As I said, though, nothing ventured, nothing gained. If this doesn't work, I'm going to try your regex "fix" on the raw html file (what html editor are you using? I seem to have a pretty endless series of mystery problems using Crimson Editor when it comes to word-wrapping, which apparently can't be "undone," because it apparently can't "see" wrapped words as contiguous for regex searches.)

THANKS, guys, your input keeps me from tearing my hair out,

Hitch
Hitch is offline   Reply With Quote