How do you deal with soft hyphens in OCR texts?
What is the best and most efficient way of rejoining words in an OCR-ed text which are split over two lines?
I use a sed script which prefixes the first part of the word to the next line, and replaces the hyphen with #, and then I examine these words for any that contain hard hyphens. Page breaks are handled.
How do you solve this problem?
|