View Single Post
Old 01-26-2021, 08:09 PM   #5
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by Ghitulescu View Post
Some OCR softwares interpret/convert a spaced word as a suite of characters separated by blank spaces: like S w i t z e r l a n d. In some cases, these can be solved by hand (for instance only important concepts are widened), however, when entire paragraphs across the whole book are widened an automatized method would be very helpful.
I would tackle this in multiple passes.

But as JSWolf has stated, you have to be extremely careful of combining letters/words that shouldn't be. Quite often, books will have things like "Person B" + "Project X" + "time y".

Example Sentence

Let's take this as an example:

Code:
<p>A decent example of S w i t z e r l a n d that I found within a G e r m a n example.</p>
Step 1

You replace the space between with a temporary character, like '+' or '¬'.

BUT, you want to handle single-letter words NOT "A", "a", or "I":

Search: \b([B-HJ-Zb-z]) ([B-HJ-Zb-z])\b
Replace: \1+\2

After you run this, you'll get:

Spoiler:
Code:
<p>A decent example of S+w+i+t+z+e+r+l a n+d that I found within a G+e+r+m a n example.</p>


Step 2

Then you want to match the "A", "a", or "I" between two already connected letters:

Search: (\+\w) ([aAI]) (\w)\b
Replace: \1+\2+\3

Spoiler:
Code:
<p>A decent example of S+w+i+t+z+e+r+l+a+n+d that I found within a G+e+r+m+a+n example.</p>


Those 2 Regexes should get you 95%+ of the way there.

From there, you have to manually check/correct. (Apostrophes, accents, emphasized words that start with 'a', or other odd cases.)

Step 3

Once you've completed everything, you replace the temporary '+' with a blank. That will merge the words together:

Search: \+
Replace: ***LEAVE THIS COMPLETELY BLANK***

Code:
<p>A decent example of Switzerland that I found within a German example.</p>
Step 3 (Alternate)

Or, if you wanted to keep the emphasis, you can do something like this:

First replace "1 letter + plus sign + 1 letter" with a span:

Search: (\w)\+(\w)
Replace: <span class="emph">\1\2</span>

Spoiler:
Code:
<p>A decent example of <span class="emph">Sw</span>+<span class="emph">it</span>+<span class="emph">ze</span>+<span class="emph">rl</span>+<span class="emph">an</span>+d that I found within a <span class="emph">Ge</span>+<span class="emph">rm</span>+<span class="emph">an</span> example.</p>


Then tackle the dangling single letters at the end (the "+d" in Switzerland):

Search: <span class="emph">(\w+)</span>\+(\w)
Replace: <span class="emph">\1\2</span>

Spoiler:
Code:
<p>A decent example of <span class="emph">Sw</span>+<span class="emph">it</span>+<span class="emph">ze</span>+<span class="emph">rl</span>+<span class="emph">and</span> that I found within a <span class="emph">Ge</span>+<span class="emph">rm</span>+<span class="emph">an</span> example.</p>


Then keep merging the "emph spans followed by a plus sign" by running this until there's 0 replacements left:

Search: <span class="emph">(\w+)</span>\+<span class="emph">
Replace: <span class="emph">\1

Code:
<p>A decent example of <span class="emph">Switzerland</span> that I found within a <span class="emph">German</span> example.</p>
Then, I highly recommend running DiapDealer's "TagMechanic" (Sigil) or "Diap's Editing Toolbag" (Calibre) to flip those <span>s into <em>.

I wrote step-by-step instructions last year in "How do I change italic <i> shortcut to use <em> instead?".

This will ultimately get you the final outcome you want:

Code:
<p>A decent example of <em>Switzerland</em> that I found within a <em>German</em> example.</p>

Last edited by Tex2002ans; 01-26-2021 at 08:42 PM.
Tex2002ans is offline   Reply With Quote