01-25-2021, 03:48 AM | #1 |
Fanatic
Posts: 563
Karma: 403106
Join Date: Aug 2014
Device: PRS-T1
|
Regular expression for removing blanks between letters
Some OCR softwares interpret/convert a spaced word as a suite of characters separated by blank spaces: like S w i t z e r l a n d. In some cases, these can be solved by hand (for instance only important concepts are widened), however, when entire paragraphs across the whole book are widened an automatized method would be very helpful.
I tried [a-z;A-Z].[a-z;A-Z]. and similar but these only identify the places, do not replace the correct letters. I could not find any relevant thread, which I hope does not suggest there is no solution to this Thank you for any hint or solution |
01-25-2021, 06:25 AM | #2 |
Resident Curmudgeon
Posts: 74,512
Karma: 129668758
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
|
The problem will be the correct space between two words. You would end up combining words by removing the space. I don't know how you would know a space between letters is in the word or between two words. You'll just have to fix this by hand.
|
Advert | |
|
01-25-2021, 08:23 AM | #3 |
Fanatic
Posts: 563
Karma: 403106
Join Date: Aug 2014
Device: PRS-T1
|
Well, essentially there are very very very few one-letter words words like "a"
This is why I used a regex with two spaces. The only practical solution is to use pairs of letters separated and preceded or followed by another space. I hoped for a nice solution. |
01-25-2021, 08:37 AM | #4 |
Resident Curmudgeon
Posts: 74,512
Karma: 129668758
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
|
That won't work either without error. By hand is your only reliable solution. But what you can do is use regex to try to find the problem words. I would go with letter space letter space letter space to find 3 letters with spaces. Then you can fix the words by hand. You could also try letter space letter space and see how that goes.
Last edited by JSWolf; 01-25-2021 at 08:40 AM. |
01-26-2021, 08:09 PM | #5 | |
Wizard
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
But as JSWolf has stated, you have to be extremely careful of combining letters/words that shouldn't be. Quite often, books will have things like "Person B" + "Project X" + "time y". Example Sentence Let's take this as an example: Code:
<p>A decent example of S w i t z e r l a n d that I found within a G e r m a n example.</p> You replace the space between with a temporary character, like '+' or '¬'. BUT, you want to handle single-letter words NOT "A", "a", or "I": Search: \b([B-HJ-Zb-z]) ([B-HJ-Zb-z])\b Replace: \1+\2 After you run this, you'll get: Spoiler:
Step 2 Then you want to match the "A", "a", or "I" between two already connected letters: Search: (\+\w) ([aAI]) (\w)\b Replace: \1+\2+\3 Spoiler:
Those 2 Regexes should get you 95%+ of the way there. From there, you have to manually check/correct. (Apostrophes, accents, emphasized words that start with 'a', or other odd cases.) Step 3 Once you've completed everything, you replace the temporary '+' with a blank. That will merge the words together: Search: \+ Replace: ***LEAVE THIS COMPLETELY BLANK*** Code:
<p>A decent example of Switzerland that I found within a German example.</p> Or, if you wanted to keep the emphasis, you can do something like this: First replace "1 letter + plus sign + 1 letter" with a span: Search: (\w)\+(\w) Replace: <span class="emph">\1\2</span> Spoiler:
Then tackle the dangling single letters at the end (the "+d" in Switzerland): Search: <span class="emph">(\w+)</span>\+(\w) Replace: <span class="emph">\1\2</span> Spoiler:
Then keep merging the "emph spans followed by a plus sign" by running this until there's 0 replacements left: Search: <span class="emph">(\w+)</span>\+<span class="emph"> Replace: <span class="emph">\1 Code:
<p>A decent example of <span class="emph">Switzerland</span> that I found within a <span class="emph">German</span> example.</p> I wrote step-by-step instructions last year in "How do I change italic <i> shortcut to use <em> instead?". This will ultimately get you the final outcome you want: Code:
<p>A decent example of <em>Switzerland</em> that I found within a <em>German</em> example.</p> Last edited by Tex2002ans; 01-26-2021 at 08:42 PM. |
|
Advert | |
|
01-27-2021, 04:06 AM | #6 |
Fanatic
Posts: 563
Karma: 403106
Join Date: Aug 2014
Device: PRS-T1
|
|
01-27-2021, 06:20 PM | #7 | |
Wizard
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
And for future info, the "gap between letters" is called letterspacing. It's sometimes used as emphasis instead of bold/italics, and can be replicated using CSS: HTML: Code:
<p>As much mud in the streets as if the waters had but newly retired from the face of the earth, and it would not be wonderful to meet a <em>Megalosaurus</em>, forty feet long or so, waddling like an elephantine lizard up <em>Holborn Hill</em>.</p> Code:
em { font-style: normal; font-weight: bold; letter-spacing: .2em; }
Especially all the posts in those two threads, we went into extremely detailed discussions about differences between italics/emphasis, bold/strong, plus different methods of application. |
|
01-29-2021, 08:15 AM | #8 |
Fanatic
Posts: 563
Karma: 403106
Join Date: Aug 2014
Device: PRS-T1
|
Code:
\b([A-Za-z]) ([A-Za-z]) ([A-Za-z]) ([A-Za-z])\b I could live with a handful of 3-letter long "escapees" I know it was called letterspacing, but the use of this term would have forced me to rewrite the sentence once again I tried to use simple words The OCR insert however spaces. |
01-29-2021, 08:32 AM | #9 |
Resident Curmudgeon
Posts: 74,512
Karma: 129668758
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
|
delete post
|
01-29-2021, 08:35 AM | #10 | |
Resident Curmudgeon
Posts: 74,512
Karma: 129668758
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
|
Quote:
You cannot regex this away. You have to do it by hand because you will combine letters/words you do not want to. Use the regex for searching. But do the fixing by hand. |
|
01-29-2021, 09:19 AM | #11 |
Fanatic
Posts: 563
Karma: 403106
Join Date: Aug 2014
Device: PRS-T1
|
Lucky me: where blanks are (whitespaces), the OCR inserts a double-space.
Yes, handwork is needed. |
01-29-2021, 06:56 PM | #12 | ||||
Wizard
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
And if you give more real-life examples, then the regex can be made more robust.
I tested both steps on the examples I gave, and it works perfectly fine on any 2+ single letters (not "a", "A", or "I") next to each other. Quote:
And can I ask: Which OCR are you using? Can you share an example page or something from this specific book? I'd be interested in taking a look. Quote:
Yes, of course, different languages are going to have their own little single-letter-word quirks... Like in Spanish, you'd want to avoid 'y' (since that = "and"). But then you would just swap out the [aAI] regex with a [yY] (or equivalent). Accents, similar situation. You'll just have to make much uglier and harder-to-understand regex. Quote:
Quote:
Better/faster to do:
than:
And as usual, I've been pondering on how to get Spellcheck Lists to help you solve this issue more efficiently. Instead of using a '+' or '¬', it might be better to use a period: Code:
<p>A decent example of S.w.i.t.z.e.r.l.a.n.d that I found within a G.e.r.m.a.n example.</p> All merged words right there in a simple list. Although the period will bring a few other minor issues (like "a.m." or "p.m."), but the amount of time you'll save is massive. Last edited by Tex2002ans; 01-29-2021 at 07:13 PM. |
||||
01-29-2021, 08:51 PM | #13 |
Resident Curmudgeon
Posts: 74,512
Karma: 129668758
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
|
The spellcheck lists is a very good idea.
|
01-30-2021, 03:31 AM | #14 |
frumious Bandersnatch
Posts: 7,516
Karma: 19000001
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
|
Maybe convert all periods to ¬ before dealing with the spaces (and back to periods after)? [That's what you use protecting groups for in chemistry ]
|
01-30-2021, 04:44 AM | #15 | |
Wizard
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
Although I'm thinking a mass replace of . may cause the book to explode if you forget to flip back! Because . is probably commonly be used outside of text (within class names, filenames, etc.): Code:
<link href="../Styles/Style0001¬css" type="text/css" rel="stylesheet"/> <a href="http://www¬mobileread¬com"> <img src="image¬jpg" /> Or "a gun": · ¬<(o.o<) Didn't know the chemistry usage though. Looks like I have something new to read about. Last edited by Tex2002ans; 01-30-2021 at 04:48 AM. |
|
Tags |
blank characters, epub 2, regular expression |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
What regular expression | Eugeen | Reading and Management | 0 | 11-29-2019 12:38 PM |
Is this possible using a regular expression? | unabatedshagie | Library Management | 2 | 03-17-2016 09:47 AM |
Please help me with regular expression :help: | Tatjana | Library Management | 2 | 05-30-2014 05:41 PM |
Regular Expression Help | smartmart | Calibre | 5 | 10-17-2010 05:19 AM |
Help with the regular expression | Dysonco | Calibre | 9 | 03-22-2010 10:45 PM |