![]() |
#1 |
Fanatic
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 580
Karma: 810184
Join Date: Sep 2010
Location: Norway
Device: prs-t1, tablet, Nook Simple, assorted kindles, iPad
|
How do you deal with soft hyphens in OCR texts?
What is the best and most efficient way of rejoining words in an OCR-ed text which are split over two lines?
I use a sed script which prefixes the first part of the word to the next line, and replaces the hyphen with #, and then I examine these words for any that contain hard hyphens. Page breaks are handled. How do you solve this problem? |
![]() |
![]() |
![]() |
#2 |
frumious Bandersnatch
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 7,543
Karma: 19001583
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
|
Something similar to that. Much easier in Spanish, where hard hyphens are very seldom used. Search "-\n", replace with "¬" (for instance), then manually check all ¬: most will disappear, some will turn back into "-" or "- ". If I detect some common case, I can bulk-replace (like finding "Jean¬Jacques" several times in the first few chapters).
Be careful with words like to-day, up-stairs, etc. They might have been written with hyphens in that book, even though they usually aren't today. Whenever some suspect appears, search the whole book for similar instances and see if there's a pattern. |
![]() |
![]() |
Advert | |
|
![]() |
#3 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
If the text is already in HTML, I use this Regex, and replace one by one:
Search: Code:
-</p>\s+<p> A faster way might be to use the above search and replace with a hyphen. Then go through your typical "hyphenation search" pass and fix any unneeded hyphens. I personally replace one by one just so I can double check with the PDF that a chunk of text is not missing (Most of my work is from PDF -> EPUB). To clean up hyphens throughout the text, I use the Spellcheck function in Sigil (Tools - Spellcheck - Spellcheck (Alt+Q)). This will give you a list of every word in the book. I then insert a hyphen in the search box (See attached image). Then I go look at the list of words with hyphens and remove them if needed. I do one pass with "Show All Words", then a pass with it unchecked (to show only words Sigil think is misspelled). And then maybe one more pass with "Show All Words" checked. In my case, these 2 or 3 passes usually catch almost all hyphenation errors throughout the book. |
![]() |
![]() |
![]() |
#4 |
Evangelist
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 450
Karma: 343115
Join Date: Nov 2009
Location: Romania
Device: PW2 2014
|
I export as RTF, open it in Word 2010, use Ctrl+F and search for "-^p". Then I manually go over each instance throughout the entire book. Yeah, it's tedious... But considering that it takes less than 5 minutes compared to the overall part of the digitization process, it's nothing. And I always proofread the final product, so the chances that one of them slipped by are really low.
|
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Calibre remove soft hyphens? | zuli | Calibre | 3 | 11-08-2017 09:20 PM |
Soft Hyphens | wallcraft | Workshop | 29 | 06-12-2012 04:21 AM |
Option for removing soft hyphens? | WarnerYoung | Calibre | 1 | 05-24-2012 11:44 PM |
Feature request: soft hyphens | paulpeer | Sigil | 3 | 12-05-2009 01:43 PM |
Calibre deletes soft Hyphens in Epub ? | NASCARaddicted | Calibre | 4 | 09-20-2009 06:31 PM |