View Single Post
Old 12-10-2011, 05:01 PM   #9
SBT
Fanatic
SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.
 
SBT's Avatar
 
Posts: 580
Karma: 810184
Join Date: Sep 2010
Location: Norway
Device: prs-t1, tablet, Nook Simple, assorted kindles, iPad
Handling words split over lines

Words which are split over lines must be rejoined.
The following function handles this, and also words split over pages and by images.
The hyphen is replaced by '#-', then you can manually inspect which hyphens should be retained.

Code:
function removehyphens {
# Usage: removehyphens [inputfile.txt]
# takes words split over two lines and prepends it at the start of the 
# next text line. Replaces the hyphen with  '#-' for manual inspection
# Removes hyphens that are probably redundant, and confirms some which are
# correct.
# If no inputfile, input is read from STDIN.
# Output written to STDOUT
[ $1 ] && inputfile=$1 || inputfile="/dev/stdin"
awk '/^ *[a-z]/ {printf("%s",hyph);sub(/^ */,"");hyph="";}\
{if (/[a-z-]- *$/) {hyph=$NF;$NF="";sub(/- *$/,"#-",hyph)};\
print;}' $inputfile |\
sed -e s/"^\(..\)#-"/"\1"/ \
-e s/"#-\(ing\|ment\)"/"\1"/ \
-e s/"\(twenty\|thirty\|forty\|fifty\|sixty\|seventy\|eighty\|ninety\)#-"/"\1-"/
}

Last edited by SBT; 12-12-2011 at 03:21 PM. Reason: properly handle double hyphens at line ends
SBT is offline   Reply With Quote