View Single Post
Old 06-18-2012, 08:38 AM   #18
SBT
Fanatic
SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.
 
SBT's Avatar
 
Posts: 580
Karma: 810184
Join Date: Sep 2010
Location: Norway
Device: prs-t1, tablet, Nook Simple, assorted kindles, iPad
Quote:
Originally Posted by Doitsu View Post
Could you please post your sed script(s)?
Righto..
First,
Pages splits:
Code:
sed '/\f/<!-- PAGEBREAK -->\n/g' in.txt >out.txt
Page headers/footers: Very often the first line on a page is the page header, so surprisingly often this works:
Code:
sed '/PAGEBREAK/{n;s/.*/<!-- PAGEHEADER: & -->/}' in.txt->out.txt
For footers, I do this:
Code:
tac in.txt|sed '/PAGEBREAK/{n;s/.*/<!-- PAGEFOOTER: & -->/}'|tac >out.txt
Words split over lines:
Code:
sed '/-$/{h;s/.* //;s/$/#/;
x;
s/[^ ]\+$//;
:a;
n;
/^[a-z]/!b a;
H;x;
s/\n//}' file.txt >new-file.txt
Now every split word is put at the head of the following line, and the hyphen replaced by '-#'. Do an interactive search&replace '-#' -> '-' to keep those hyphens you want, and then do a batch search & replace to remove the rest.
The first part of a split word is prepended to the next line starting with a lower-case letter, so it will work across page breaks. Therefore extraneous leading/trailing blanks will cause problems, as will hyphenated capitalized names (e.g. Karl-Otto)

Next, chapters. First I do a grep to check all headings are there and present. E.g. assuming chapter headings are "CHAPTER 1,2,3...."
Code:
grep CHAPTER file.txt
If chapters have a title on the following line, I inspect that:
Code:
grep -A 2 CHAPTER file.txt
That gives me the two following lines, so I can see if the title is split over more than one line, if so I manually concatenate the title on to one line
Having satisfied myself that all the chapters are on the form
Quote:
CHAPTER 3
An old cliché

It was a dark and stormy night....
I execute the following abomination:
Code:
sed '/CHAPTER/{s/.*/<\/p>\n<hr class="endchapter\/>\n<h2 class="chapter">&<\/h2>/;
n;
s/.*/<h2 class="chapter_title">&<\/h2>/;
n;n;
s/\([^A-Z]*[A-Z]\)\([^ ]*\)/<p class="initial"><span class="drop">\1<\/span><span class="first">\2<\/span>/}' file.txt> new_file.txt
, which closes the last paragraph and appends a line at the end of the preceeding chapter, wraps chapter number and title in suitable tags, and inserts a drop cap á la Jellby in the first paragraph.

Then you can do a
Code:
sed 's/^ \{3,12\}\([^ ]\)/<\/p>\n<p>\1/'
to convert text indents to paragraphs.

If your chapter heading instead is just a centered roman numeral, you can use /^ \{20,\}[IVX]\+\.$/ instead of /CHAPTER/.

And for my final trick, I'll split this file into one xhtml file for each chapter:
Code:
 sed -e '/<h2 class="chapter"/i</body>\n</html>\n<html>\n<body>'  -e '1i<html>\n<body>' -e '$a</body>\n</html>' file.txt|\
csplit -f epub/OEBPS/ -b "%2.2d.xhtml" - '/<html>/' '{*}' \
 && rm epub/OEBPS/00.xhtml
This'll put files 00.xhtml, 01.xhtml,... in directory epub/OEBPS
00.xhtml is an empty file, and is therefore removed.
In real life, you'll want a proper header for your xhtml files, but I figured that putting in a proper header wouldn't improve the readability...

My own workflow differs slightly from this. These commands have been made into shell functions, and chapter headings etc. are converted into an intermediate, self-composed markup that doesn't ruin the readability of the text in the same way as html.
I also do some contorted sed&awk expressions to handle footnotes and images.

There you basically have it. Working out how the commands work is left as an exercise to the reader. Not fully automatic, but not many repeptitive operations either, all done using computer stone-age tools. Try to do this with yer fancy new-fangled colour-coded GOEEY interfaces!
SBT is offline   Reply With Quote