MobileRead Forums - View Single Post - Regular expression for removing blanks between letters

Tex2002ans · 01-29-2021, 07:56 PM

Quote:

Originally Posted by Ghitulescu

as most words are 4+ letters long.

And if you give more real-life examples, then the regex can be made more robust.

I tested both steps on the examples I gave, and it works perfectly fine on any 2+ single letters (not "a", "A", or "I") next to each other.

Quote:

Originally Posted by Ghitulescu

Lucky me: where blanks are (whitespaces), the OCR inserts a double-space.

Good to hear. If those two-spaces only occur in the letterspaced words... then your life is easier, no fancy regex needed!

And can I ask:

Which OCR are you using?

Can you share an example page or something from this specific book?

I'd be interested in taking a look.

Quote:

Originally Posted by Ghitulescu

Also, being foreign language, the elimination of I (first person) would have been counterproductive (lots of foreign glyphs are OCRed as I, for instance ïìîı, because they are longer than i, also l is considered as I in sans-serif fonts).

Which language? I was assuming English and no accents.

Yes, of course, different languages are going to have their own little single-letter-word quirks...

Like in Spanish, you'd want to avoid 'y' (since that = "and").

But then you would just swap out the [aAI] regex with a [yY] (or equivalent).

Accents, similar situation. You'll just have to make much uglier and harder-to-understand regex.

Quote:

Originally Posted by Ghitulescu

I know it was called letterspacing, but the use of this term would have forced me to rewrite the sentence once again

I tried to use simple words

Yep, it's just helpful when someone searches for a solution in the future for "how do I fix a gap between letters?".

Quote:

Originally Posted by JSWolf

Again do it by hand. What if you have something like "o n a bus"? You would end up with "ona bus"

You cannot regex this away. You have to do it by hand because you will combine letters/words you do not want to.

Use the regex for searching. But do the fixing by hand.

You can use Regex to do the vast bulk of the corrections, then manually fix the edge cases.

Better/faster to do:

95% correct with a 2-step regex.
5% manually find/correct/fix.

than:

100% manually fix.

And as usual, I've been pondering on how to get Spellcheck Lists to help you solve this issue more efficiently.

Instead of using a '+' or '¬', it might be better to use a period:

Code:

<p>A decent example of S.w.i.t.z.e.r.l.a.n.d that I found within a G.e.r.m.a.n example.</p>

This allows you to spot all of them easily in Sigil's or Calibre's Spellcheck Lists:

Click image for larger version

Name: Spellcheck.List.-.Letterspacing.Fix.png
Views: 538
Size: 7.4 KB
ID: 185083

Click image for larger version

Name: Spellcheck.List.-.Letterspacing.Fix.2.png
Views: 532
Size: 6.2 KB
ID: 185084

All merged words right there in a simple list.

Although the period will bring a few other minor issues (like "a.m." or "p.m."), but the amount of time you'll save is massive.