|06-24-2013, 02:34 PM||#1|
Join Date: Sep 2010
Device: prs-t1, tablet, Nook Simple, assorted kindles
How do you deal with soft hyphens in OCR texts?
What is the best and most efficient way of rejoining words in an OCR-ed text which are split over two lines?
I use a sed script which prefixes the first part of the word to the next line, and replaces the hyphen with #, and then I examine these words for any that contain hard hyphens. Page breaks are handled.
How do you solve this problem?
|06-24-2013, 02:44 PM||#2|
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
Something similar to that. Much easier in Spanish, where hard hyphens are very seldom used. Search "-\n", replace with "¬" (for instance), then manually check all ¬: most will disappear, some will turn back into "-" or "- ". If I detect some common case, I can bulk-replace (like finding "Jean¬Jacques" several times in the first few chapters).
Be careful with words like to-day, up-stairs, etc. They might have been written with hyphens in that book, even though they usually aren't today. Whenever some suspect appears, search the whole book for similar instances and see if there's a pattern.
|06-24-2013, 08:27 PM||#3|
Join Date: Jul 2012
If the text is already in HTML, I use this Regex, and replace one by one:
A faster way might be to use the above search and replace with a hyphen. Then go through your typical "hyphenation search" pass and fix any unneeded hyphens. I personally replace one by one just so I can double check with the PDF that a chunk of text is not missing (Most of my work is from PDF -> EPUB).
To clean up hyphens throughout the text, I use the Spellcheck function in Sigil (Tools - Spellcheck - Spellcheck (Alt+Q)). This will give you a list of every word in the book. I then insert a hyphen in the search box (See attached image).
Then I go look at the list of words with hyphens and remove them if needed. I do one pass with "Show All Words", then a pass with it unchecked (to show only words Sigil think is misspelled). And then maybe one more pass with "Show All Words" checked. In my case, these 2 or 3 passes usually catch almost all hyphenation errors throughout the book.
|06-26-2013, 02:07 PM||#4|
Join Date: Nov 2009
Device: PW2 2014
I export as RTF, open it in Word 2010, use Ctrl+F and search for "-^p". Then I manually go over each instance throughout the entire book. Yeah, it's tedious... But considering that it takes less than 5 minutes compared to the overall part of the digitization process, it's nothing. And I always proofread the final product, so the chances that one of them slipped by are really low.
|Thread Tools||Search this Thread|
|Thread||Thread Starter||Forum||Replies||Last Post|
|Soft Hyphens||wallcraft||Workshop||29||06-12-2012 04:21 AM|
|Option for removing soft hyphens?||WarnerYoung||Calibre||1||05-24-2012 11:44 PM|
|Calibre remove soft hyphens?||zuli||Calibre||1||01-17-2010 12:13 PM|
|Feature request: soft hyphens||paulpeer||Sigil||3||12-05-2009 01:43 PM|
|Calibre deletes soft Hyphens in Epub ?||NASCARaddicted||Calibre||4||09-20-2009 06:31 PM|