MobileRead Forums - View Single Post - Regular expression for removing blanks between letters

Tex2002ans · 01-26-2021, 08:09 PM

Quote:

Originally Posted by Ghitulescu

Some OCR softwares interpret/convert a spaced word as a suite of characters separated by blank spaces: like S w i t z e r l a n d. In some cases, these can be solved by hand (for instance only important concepts are widened), however, when entire paragraphs across the whole book are widened an automatized method would be very helpful.

I would tackle this in multiple passes.

But as JSWolf has stated, you have to be extremely careful of combining letters/words that shouldn't be. Quite often, books will have things like "Person B" + "Project X" + "time y".

Example Sentence

Let's take this as an example:

Code:

<p>A decent example of S w i t z e r l a n d that I found within a G e r m a n example.</p>

Step 1

You replace the space between with a temporary character, like '+' or '¬'.

BUT, you want to handle single-letter words NOT "A", "a", or "I":

Search: \b([B-HJ-Zb-z]) ([B-HJ-Zb-z])\b
Replace: \1+\2

After you run this, you'll get:

Spoiler:

Step 2

Then you want to match the "A", "a", or "I" between two already connected letters:

Search: (\+\w) ([aAI]) (\w)\b
Replace: \1+\2+\3

Spoiler:

Those 2 Regexes should get you 95%+ of the way there.

From there, you have to manually check/correct. (Apostrophes, accents, emphasized words that start with 'a', or other odd cases.)

Step 3

Once you've completed everything, you replace the temporary '+' with a blank. That will merge the words together:

Search: \+
Replace: ***LEAVE THIS COMPLETELY BLANK***

Code:

<p>A decent example of Switzerland that I found within a German example.</p>

Step 3 (Alternate)

Or, if you wanted to keep the emphasis, you can do something like this:

First replace "1 letter + plus sign + 1 letter" with a span:

Search: (\w)\+(\w)
Replace: \1\2

Spoiler:

Then tackle the dangling single letters at the end (the "+d" in Switzerland):

Search: (\w+)\+(\w)
Replace: \1\2

Spoiler:

Then keep merging the "emph spans followed by a plus sign" by running this until there's 0 replacements left:

Search: (\w+)\+
Replace: \1

Code:

<p>A decent example of <span class="emph">Switzerland</span> that I found within a <span class="emph">German</span> example.</p>

Then, I highly recommend running DiapDealer's "TagMechanic" (Sigil) or "Diap's Editing Toolbag" (Calibre) to flip those s into .

I wrote step-by-step instructions last year in "How do I change italic shortcut to use instead?".

This will ultimately get you the final outcome you want:

Code:

<p>A decent example of <em>Switzerland</em> that I found within a <em>German</em> example.</p>