View Single Post
Old 01-29-2021, 06:56 PM   #12
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by Ghitulescu View Post
as most words are 4+ letters long.
And if you give more real-life examples, then the regex can be made more robust.

I tested both steps on the examples I gave, and it works perfectly fine on any 2+ single letters (not "a", "A", or "I") next to each other.

Quote:
Originally Posted by Ghitulescu View Post
Lucky me: where blanks are (whitespaces), the OCR inserts a double-space.
Good to hear. If those two-spaces only occur in the letterspaced words... then your life is easier, no fancy regex needed!

And can I ask:

Which OCR are you using?

Can you share an example page or something from this specific book?

I'd be interested in taking a look.

Quote:
Originally Posted by Ghitulescu View Post
Also, being foreign language, the elimination of I (first person) would have been counterproductive (lots of foreign glyphs are OCRed as I, for instance ïìîı, because they are longer than i, also l is considered as I in sans-serif fonts).
Which language? I was assuming English and no accents.

Yes, of course, different languages are going to have their own little single-letter-word quirks...

Like in Spanish, you'd want to avoid 'y' (since that = "and").

But then you would just swap out the [aAI] regex with a [yY] (or equivalent).

Accents, similar situation. You'll just have to make much uglier and harder-to-understand regex.

Quote:
Originally Posted by Ghitulescu View Post
I know it was called letterspacing, but the use of this term would have forced me to rewrite the sentence once again I tried to use simple words
Yep, it's just helpful when someone searches for a solution in the future for "how do I fix a gap between letters?".

Quote:
Originally Posted by JSWolf View Post
Again do it by hand. What if you have something like "o n a bus"? You would end up with "ona bus"

You cannot regex this away. You have to do it by hand because you will combine letters/words you do not want to.

Use the regex for searching. But do the fixing by hand.
You can use Regex to do the vast bulk of the corrections, then manually fix the edge cases.

Better/faster to do:
  • 95% correct with a 2-step regex.
  • 5% manually find/correct/fix.

than:
  • 100% manually fix.

And as usual, I've been pondering on how to get Spellcheck Lists to help you solve this issue more efficiently.

Instead of using a '+' or '¬', it might be better to use a period:

Code:
<p>A decent example of S.w.i.t.z.e.r.l.a.n.d that I found within a G.e.r.m.a.n example.</p>
This allows you to spot all of them easily in Sigil's or Calibre's Spellcheck Lists:

Click image for larger version

Name:	Spellcheck.List.-.Letterspacing.Fix.png
Views:	526
Size:	7.4 KB
ID:	185083 Click image for larger version

Name:	Spellcheck.List.-.Letterspacing.Fix.2.png
Views:	526
Size:	6.2 KB
ID:	185084

All merged words right there in a simple list.

Although the period will bring a few other minor issues (like "a.m." or "p.m."), but the amount of time you'll save is massive.

Last edited by Tex2002ans; 01-29-2021 at 07:13 PM.
Tex2002ans is offline   Reply With Quote