View Single Post
Old 06-24-2013, 02:34 PM   #1
SBT
Fanatic
SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.
 
SBT's Avatar
 
Posts: 580
Karma: 810184
Join Date: Sep 2010
Location: Norway
Device: prs-t1, tablet, Nook Simple, assorted kindles, iPad
How do you deal with soft hyphens in OCR texts?

What is the best and most efficient way of rejoining words in an OCR-ed text which are split over two lines?

I use a sed script which prefixes the first part of the word to the next line, and replaces the hyphen with #, and then I examine these words for any that contain hard hyphens. Page breaks are handled.

How do you solve this problem?
SBT is offline   Reply With Quote