View Single Post
Old 06-18-2012, 03:24 AM   #13
SBT
Fanatic
SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.
 
SBT's Avatar
 
Posts: 580
Karma: 810184
Join Date: Sep 2010
Location: Norway
Device: prs-t1, tablet, Nook Simple, assorted kindles, iPad
I've wondered what's the best way of handling words split over lines when proofing OCR texts.
I use sed to get all of the word on one line, and then do interactive search&replace in an editor to remove soft hyphens.
I also use sed to automatically detect chapter headings and any subtitles, page headers, page numbers, and paragraphs.
SBT is offline   Reply With Quote