Quote:
Originally Posted by Ghitulescu
as most words are 4+ letters long.
|
And if you give more real-life examples, then the regex can be made more robust.
I tested both steps on the examples I gave, and it works perfectly fine on any 2+ single letters (not "a", "A", or "I") next to each other.
Quote:
Originally Posted by Ghitulescu
Lucky me: where blanks are (whitespaces), the OCR inserts a double-space.
|
Good to hear. If those two-spaces only occur in the letterspaced words... then your life is easier, no fancy regex needed!
And can I ask:
Which OCR are you using?
Can you share an example page or something from this specific book?
I'd be interested in taking a look.
Quote:
Originally Posted by Ghitulescu
Also, being foreign language, the elimination of I (first person) would have been counterproductive (lots of foreign glyphs are OCRed as I, for instance ïìîı, because they are longer than i, also l is considered as I in sans-serif fonts).
|
Which language? I was assuming English and no accents.
Yes, of course, different languages are going to have their own little single-letter-word quirks...
Like in Spanish, you'd want to avoid 'y' (since that = "and").
But then you would just swap out the [aAI] regex with a [yY] (or equivalent).
Accents, similar situation. You'll just have to make much uglier and harder-to-understand regex.
Quote:
Originally Posted by Ghitulescu
I know it was called letterspacing, but the use of this term would have forced me to rewrite the sentence once again  I tried to use simple words
|
Yep, it's just helpful when someone searches for a solution in the future for "how do I fix a gap between letters?".
Quote:
Originally Posted by JSWolf
Again do it by hand. What if you have something like "o n a bus"? You would end up with "ona bus"
You cannot regex this away. You have to do it by hand because you will combine letters/words you do not want to.
Use the regex for searching. But do the fixing by hand.
|
You can use Regex to do the vast bulk of the corrections, then manually fix the edge cases.
Better/faster to do:
- 95% correct with a 2-step regex.
- 5% manually find/correct/fix.
than:
And as usual, I've been pondering on how to get Spellcheck Lists to help you solve this issue more efficiently.
Instead of using a '+' or '¬', it might be better to use a period:
Code:
<p>A decent example of S.w.i.t.z.e.r.l.a.n.d that I found within a G.e.r.m.a.n example.</p>
This allows you to spot all of them easily in Sigil's or Calibre's Spellcheck Lists:
All merged words right there in a simple list.
Although the period will bring a few other minor issues (like "a.m." or "p.m."), but the amount of time you'll save is massive.