MobileRead Forums - View Single Post

SBT · 12-10-2011, 05:01 PM

Words which are split over lines must be rejoined.
The following function handles this, and also words split over pages and by images.
The hyphen is replaced by '#-', then you can manually inspect which hyphens should be retained.

Code:

function removehyphens {
# Usage: removehyphens [inputfile.txt]
# takes words split over two lines and prepends it at the start of the 
# next text line. Replaces the hyphen with  '#-' for manual inspection
# Removes hyphens that are probably redundant, and confirms some which are
# correct.
# If no inputfile, input is read from STDIN.
# Output written to STDOUT
[ $1 ] && inputfile=$1 || inputfile="/dev/stdin"
awk '/^ *[a-z]/ {printf("%s",hyph);sub(/^ */,"");hyph="";}\
{if (/[a-z-]- *$/) {hyph=$NF;$NF="";sub(/- *$/,"#-",hyph)};\
print;}' $inputfile |\
sed -e s/"^\(..\)#-"/"\1"/ \
-e s/"#-\(ing\|ment\)"/"\1"/ \
-e s/"\(twenty\|thirty\|forty\|fifty\|sixty\|seventy\|eighty\|ninety\)#-"/"\1-"/
}