Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Formats > Workshop

Notices

Reply
 
Thread Tools Search this Thread
Old 06-24-2013, 02:34 PM   #1
SBT
Fanatic
SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.
 
SBT's Avatar
 
Posts: 580
Karma: 810184
Join Date: Sep 2010
Location: Norway
Device: prs-t1, tablet, Nook Simple, assorted kindles, iPad
How do you deal with soft hyphens in OCR texts?

What is the best and most efficient way of rejoining words in an OCR-ed text which are split over two lines?

I use a sed script which prefixes the first part of the word to the next line, and replaces the hyphen with #, and then I examine these words for any that contain hard hyphens. Page breaks are handled.

How do you solve this problem?
SBT is offline   Reply With Quote
Old 06-24-2013, 02:44 PM   #2
Jellby
frumious Bandersnatch
Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.
 
Jellby's Avatar
 
Posts: 7,514
Karma: 18512745
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
Something similar to that. Much easier in Spanish, where hard hyphens are very seldom used. Search "-\n", replace with "¬" (for instance), then manually check all ¬: most will disappear, some will turn back into "-" or "- ". If I detect some common case, I can bulk-replace (like finding "Jean¬Jacques" several times in the first few chapters).

Be careful with words like to-day, up-stairs, etc. They might have been written with hyphens in that book, even though they usually aren't today. Whenever some suspect appears, search the whole book for similar instances and see if there's a pattern.
Jellby is offline   Reply With Quote
Advert
Old 06-24-2013, 08:27 PM   #3
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
If the text is already in HTML, I use this Regex, and replace one by one:

Search:

Code:
-</p>\s+<p>
Replace with nothing.

A faster way might be to use the above search and replace with a hyphen. Then go through your typical "hyphenation search" pass and fix any unneeded hyphens. I personally replace one by one just so I can double check with the PDF that a chunk of text is not missing (Most of my work is from PDF -> EPUB).

To clean up hyphens throughout the text, I use the Spellcheck function in Sigil (Tools - Spellcheck - Spellcheck (Alt+Q)). This will give you a list of every word in the book. I then insert a hyphen in the search box (See attached image).

Then I go look at the list of words with hyphens and remove them if needed. I do one pass with "Show All Words", then a pass with it unchecked (to show only words Sigil think is misspelled). And then maybe one more pass with "Show All Words" checked. In my case, these 2 or 3 passes usually catch almost all hyphenation errors throughout the book.
Attached Thumbnails
Click image for larger version

Name:	SigilSpellcheck.png
Views:	426
Size:	12.2 KB
ID:	107324  
Tex2002ans is offline   Reply With Quote
Old 06-26-2013, 02:07 PM   #4
DSpider
Evangelist
DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.
 
DSpider's Avatar
 
Posts: 450
Karma: 343115
Join Date: Nov 2009
Location: Romania
Device: PW2 2014
I export as RTF, open it in Word 2010, use Ctrl+F and search for "-^p". Then I manually go over each instance throughout the entire book. Yeah, it's tedious... But considering that it takes less than 5 minutes compared to the overall part of the digitization process, it's nothing. And I always proofread the final product, so the chances that one of them slipped by are really low.
DSpider is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Calibre remove soft hyphens? zuli Calibre 3 11-08-2017 09:20 PM
Soft Hyphens wallcraft Workshop 29 06-12-2012 04:21 AM
Option for removing soft hyphens? WarnerYoung Calibre 1 05-24-2012 11:44 PM
Feature request: soft hyphens paulpeer Sigil 3 12-05-2009 01:43 PM
Calibre deletes soft Hyphens in Epub ? NASCARaddicted Calibre 4 09-20-2009 06:31 PM


All times are GMT -4. The time now is 10:30 PM.


MobileRead.com is a privately owned, operated and funded community.