MobileRead Forums - View Single Post - How do you deal with soft hyphens in OCR texts?

SBT · 06-24-2013, 02:34 PM

What is the best and most efficient way of rejoining words in an OCR-ed text which are split over two lines?

I use a sed script which prefixes the first part of the word to the next line, and replaces the hyphen with #, and then I examine these words for any that contain hard hyphens. Page breaks are handled.

How do you solve this problem?

06-24-2013, 02:34 PM	#1
SBT Fanatic Posts: 580 Karma: 810184 Join Date: Sep 2010 Location: Norway Device: prs-t1, tablet, Nook Simple, assorted kindles, iPad	How do you deal with soft hyphens in OCR texts? What is the best and most efficient way of rejoining words in an OCR-ed text which are split over two lines? I use a sed script which prefixes the first part of the word to the next line, and replaces the hyphen with #, and then I examine these words for any that contain hard hyphens. Page breaks are handled. How do you solve this problem?