View Full Version : Epub creation in unix shell


SBT
12-02-2011, 04:28 PM
Being rather old-fashioned, and convinced that the current obsession with graphical user interfaces is just a passing fad, I make my epub-formatted books within the comfort of a unix bash shell. During their production, I’ve gradually developed a handful of tricks and functions to speed up things that may possibly appeal to others, though I've very probably reinvented a few wheels along the way.:o
Roughly speaking, my epub-making recipe is as follows:

Extract text from djvu file & handle formatting characters
Identify and possibly correct obvious OCR errors
Extract each page as image. Create html file with image & text side by side.
Edit this in LibreOffice; proofread, mark headings, footnotes etc.
Remove page images from file, handle book structure
Handle footnotes.
Handle words split over lines.
Generate epub file/directory structure
Split html file into chapters.
Generate toc & manifest.
zip & verify

The main audience for this post is, I suppose, people who like me also like to tinker, poke, and generally mess around with their e-books to get them just so, but the bits which partly automate the drudgery of proofreading and editing may prove of interest to those who are (very understandably!) satisfied with Calibre and Sigil.
I thought it would be nice to create a thread to present the details; then I can present the recipe one step at the time, and with a bit of luck someone will point out how I could solve the various problems even more efficiently. The thread can then possibly be used as a source for making a nice HowTo.

The book I’m currently working on is Elisha K. Kane: The Second Grinnell Expedition, Vol. II (source:Internet Archive), so I’ll use that as a case study.
Off we go: First, extract text from the djvu-file.
Required tools: DjvuLibre.
(To use this code snippet, save it as a file, e.g. ‘epubtools.sh’. Type ‘source epubtools.sh’, and you can use ‘extracttext <djvufile>’ like any other command.)
function extracttext {
# Usage: extracttext <filename.djvu>. Outputs textfile to filename.txt
n=$(djvused $1 -e 'n') # Find total pagenumber
f=${1%.djvu}.txt # Output file name
rm -i $f # Interactively delete existing output file
for x in $(seq $n) # foreach page
do
echo "%P $x" >> $f # write %P <pageno> to file before page content
# Get page, replace vertical tab -> %K, unit separator -> %_,
# unit separator -> tab,
# unit + group separators -> %L, remove form feed, drop last line.
# Replace unit + group sep. + multiple vert. tabs with %i <pageno>,
# indicates image caption. Remove empty lines/extraneous format chars.
# Prepend %p to first line, indicating page header.
# Prepend %n to footer page number/ volume indicator.
djvutxt --pages=$x $1 |\
sed -e s/"^K"/"%K"/g \
-e s/"^_^]"/"%L"/g \
-e s/"^_"/" "/g \
-e s/"^L"/""/g \
-e \$d |\
sed -e s/"^%L\(%K\)\{2,22\}"/"%i $x "/ \
-e s/"%L%K"// \
-e s/"^[ ]*\([0-9]\{2,3\}\|[Vv][oO0][lL1].*\) *$"/"%n &"/ \
-e 1s/"^"/"%p "/ \
-e /"^$"/d \
>> $f
printf "Page: %3d/%d\r" $x $n # send progress status to STDOUT
done
echo
}


The observant reader may wonder why I extract the text page by page and don’t simply dump the entire text file at once with djvutxt *.djvu? The reason is that djvutxt doesn’t produce page breaks (\f) for blank pages, and I wish to keep a record of the djvu page number.
djvu uses vertical tab, group indicator, and form feed control characters, these are transcribed to readable chars, and interpreted as indicated in the script.
Why not insert html codes like <p> instead of tabs? Because while proofreading, I like to keep the file as close to pure text as possible. Instead, I use home-brewed mnemonics (I sense people shuddering), %P for page breaks, %p for page headings, etc. This is fairly unobtrusive and easy to filter with various tools.
This is not a universal tool for handling djvu-files; if for example there are no page headers, or if there are footers instead, it won’t work well. However, I suspect it’s simpler to learn sufficient shell scripting to modify the script to match different book formats than to learn how to use a monstrous everything-to-all-books function with umpteen settings and options.
Next time: Identifying and correcting OCR errors.

SBT
12-03-2011, 03:52 PM
OCR does stumble here and there. So does human proofreading, it’s easy to miss the occasional mispnirt... However, with some careful thought, it is possible to construct search patterns which identify a decent proportion of them.
The first category of errors is the one that can be automatically corrected. Spaces before punctuation like ;:,?! can be safely removed, as can spaces after quotation marks at the start of a line or before them at the end of a line. Likewise ‘ tlie’ can confidently be replaced with ‘ the’, and ‘ m ‘ with ‘ in ‘.

function tlie_m_punctuationclean {
# Usage: tlie_m_punctuationclean <text file>.
# Autocorrects in-place some OCR errors.
sed -i -e s/" *\([:?!;]\)"/"\1"/g \
-e s/"\(^ *\| \)\" \+"/"\1\""/ \
-e s/" \+\" \+$"/"\""/ \
-e s/"\([ ][Tt]\|^[Tt]\)lie"/"\1he"/ \
-e s/" m "/" in "/ \
$*
sed -i -e s/"\"'"/"\"\ '"/g \
-e s/"'\""/"'\ \""/g \
$*
}

The second category consists of evident errors, but where the correct version is not self-evident. Capital letters immediately after lower case, numbers following letters, symbols embedded in letters, and q not followed by u are typical. The following function prepends words which contain any such combination with a ‘~’(tilde). This complements the hat, ‘^’, which is used by many OCR programs to indicate failure to interpret. So afterwards, you have to search for ~’s and ^’s.

function marksuspects {
# Usage: marksuspects <text file>. Prepends a ^ in front of words that need
# correction. Edits in-place.
sed -i s/"\([^ ]*\)\([a-z][A-Z0-9]\|[A-Za-z][(){\[\]}.,;:?!][A-Za-z]\|q[^u]\)"/"~\1\2"/g $1
}

The search patterns in these functions can also be used in editors which support regular expressions; Sigil and LibreOffice do.

Next: Combining page scans with page text.

DiapDealer
12-03-2011, 05:04 PM
as can spaces after quotation marks at the start of a line or before them at the end of a line.
“ ‘ Dang!’ the fellow said, ‘I've always put spaces in between adjacent double and single curly-quotes in my text so they’re easier to distinguish’ he continued, ‘Wouldn’t your regex eliminate that space?’ ”

:D

Jellby
12-04-2011, 06:03 AM
“ ‘ Dang!’ the fellow said, ‘I've always put spaces in between adjacent double and single curly-quotes in my text so they’re easier to distinguish’ he continued, ‘Wouldn’t your regex eliminate that space?’ ”

I always try to remove those spaces ;) I think that should be handled by font kerning (and I've modified my preferred font to add kerning pairs between quotes). In any case, those spaces should be non-breaking and thin if possible (& #8239;)

DiapDealer
12-04-2011, 08:42 AM
In any case, those spaces should be non-breaking and thin if possible ( & #8239; )
Ah... good point. Thanks!

SBT
12-04-2011, 05:12 PM
Thanks for feedback, Jellby and DiapDealer. I’ve updated the function accordingly. Good proofreading patterns are probably worthy a thread of their own.

Anyhow, on to todays task:
When proofreading an OCR text, it’s a necessity to have the scanned page images side by side with the text. Of course you can open the djvu/pdf file in a viewer and the text in a separate editor, but it is a trifle tiresome to hop back and forth between them to synchronize page viewing. In a previous post (http://www.mobileread.com/forums/showthread.php?p=1533286#post1533286) I presented a script to combine the images and text in an HTML table, which then could be imported into LibreOffice and edited there.
A slightly revised version is shown below.
First, a directory is filled with the page images extracted from the djvu-file. As this is a time-consuming operation, it is devolved to a separate function. This version assumes the book is less than 1000 pages. It also scales down the image, and clips it. This clipping is probably book-dependent, and the coordinates are probably possible to extract from the djvu-file, but finding out how is on the TODO list.
Required tools: netpbm and cjpeg.

function extractpagescans {
# Usage: extractpagescans <djvufile>.
# Creates a jpeg-file of each page, and stores it in directory "pages"
mkdir pages
n=$(djvused $1 -e 'n')
for x in $(seq $n)
do
ddjvu -format=ppm -page=$x -segment=1700x2850+200+200 $1 - |pnmscale 0.5 | cjpeg -quality 35 -smooth 50 -scale "1/2" -optimize>$(printf "pages/%3.3d.jpg" $x)
echo $x
done
}

Then the HTML file is constructed. The text on each page is enclosed between <pre></pre> tags to preserve line breaks and other formatting.

function makeproofreadhtml {
#Usage: makeproofreadhtml <textfile>
#creates a html file with a two-column table, page scans to the left, OCR text t
o the right.
imgdir=pages
awk -v img="$(basename $imgdir)" '
BEGIN {
# Set form feed (new page) control character as record separator
RS="%P"
charset="utf8"
# html header
print "<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.1//EN\" \"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd\">\n\
<html xmlns=\"http://www.w3.org/1999/xhtml\">\n\
<head>\n\
<meta http-equiv=\"content-type\" content=\"text/html; charset="charset"\" />\n\
</head>\n\
<body>\n\
<table>"
}

{
# substitute <,>,& characters with html codes
gsub("&","&amp;")
gsub("<","&lt;")
gsub(">","&gt;")
# add scan image in left column
print "<tr><td>"
printf("<img width=\"500\" src=\"%s/%3.3d.jpg\">", img, NR-1)
print "</td>"

# add text as preformatted text, preserving line breaks etc., in right column
print "<td>"
# add PDF file page number as cell header, embedded in HTML comment
print "<pre>"
printf("%%G ")
print
print "</pre>"
print "</td></tr>"
}

END {
# wrap up html file
print "</table>"
print "</body>"
print "</html>"
}' $1\
> ${1%txt}html # output to html file
}


The HTML file can then be read by LibreOffice, and you can start correcting all those ~’s.
I also change the font to italic and bold where indicated in the scans while in LibreOffice, but for other types of formatting I prefer to use %-type mnemonics.

Next: Extracting the text from this HTML-file, handling mnemonics and footnotes, and producing an XHTML-compliant file.

opitzs
12-10-2011, 12:50 AM
SBT, let me please thank you, this is a very instructive post and now I need to sit down...

SBT
12-10-2011, 02:32 PM
@opitz: Thanks for your kind words; any kind of feedback is welcome.

When you've finished proofreading in LibreOffice, or just want to return to editing in a pure text editor, you can use the following function which is the reverse of makeproofread (which I think I'll rename txt2proof, so there'll be some consistency.

I thought it would be a good idea to read input from STDIN if no filename is given; I'll probably add that functionality to all functions where appropriate

function proof2txt {
# Usage: proof2txt [inputfile.html].
# If no inputfile, input is read from STDIN.
# Output written to STDOUT
[ $1 ] && inputfile=$1 || inputfile="/dev/stdin"
# Handle text marked as italic/bold.
# LibreOffice inserts </I> and <I> (ditto for bold) at the end and beginning
# of italic sections than spans several lines.
# Enclosing <..> tags are replaced by html-encoded < & > for italics/bold.
sed '1h;1!H;${g;s/<\/I>\n<I>/\n/g;s/<\/B>\n<B>/\n/g;p;}' $inputfile |\
sed s/'<\(\/\?[BI]\)>'/'\&lt;\1\&gt;'/g |\
lynx -dump -stdin |\
grep -v "^ \[[0-9]\{3,3\}\.jpg\]"
}

SBT
12-10-2011, 06:01 PM
Words which are split over lines must be rejoined.
The following function handles this, and also words split over pages and by images.
The hyphen is replaced by '#-', then you can manually inspect which hyphens should be retained.

function removehyphens {
# Usage: removehyphens [inputfile.txt]
# takes words split over two lines and prepends it at the start of the
# next text line. Replaces the hyphen with '#-' for manual inspection
# Removes hyphens that are probably redundant, and confirms some which are
# correct.
# If no inputfile, input is read from STDIN.
# Output written to STDOUT
[ $1 ] && inputfile=$1 || inputfile="/dev/stdin"
awk '/^ *[a-z]/ {printf("%s",hyph);sub(/^ */,"");hyph="";}\
{if (/[a-z-]- *$/) {hyph=$NF;$NF="";sub(/- *$/,"#-",hyph)};\
print;}' $inputfile |\
sed -e s/"^\(..\)#-"/"\1"/ \
-e s/"#-\(ing\|ment\)"/"\1"/ \
-e s/"\(twenty\|thirty\|forty\|fifty\|sixty\|seventy\|ei ghty\|ninety\)#-"/"\1-"/
}

SBT
12-11-2011, 04:09 PM
I'm too lazy to write out html tags unless I absolutely have to. Therefore I use tags to indicate document properties, and convert them to html at the end of the formatting process. The general rule is that tags consist of % at the beginning of the line, followed by a single character.

%c - chapter heading
%e - chapter end
%P <djvu file page number> - page separator
%p - page header
%w - page footer
%i <file page no.> <caption> - image
%y - First paragraph in chapter
%f - footnote
tab/8 spaces at start of line - new paragraph.
footnote references are indicated by @ followed by an optional index number, and can
occur anywhere on the line.
I introduce other tags as I need them, so I've also used tags for subtitles, sections, vertical spacing, horisontal separators, chapter introductions etc.

Why have an end chapter tag as well as a begin chapter tag?
In most cases, this is superfluous, but in some cases there can be text or pictures before the chapter heading proper.

Here's a sample tagged document:

%P 1
MY GREAT NOVEL
BY
Long-forgotten author
%P 2
dedications, contents and stuff

%e
%P 2
%p INTRODUCTION 1

%c CHAPTER I

%y It was a dark and stormy night.
Suddenly, a voice cried out.

%i 2 A stormy illustration

%P 3
%p INTRODUCTION 2
Why this voice@ cried out,
nobody could adequately explain then and there.

%f Though it was generally agreed to
be a female voice.

%P 4
%p INTRODUCTION 3
Thus the setting for this novel should
have been set.
%e
%P 5

%c CHAPTER II
%y A glorious morning spread happiness and joy...

SBT
12-12-2011, 06:27 PM
And for my next trick:
Take footnote references and footnotes, indicated by @'s and %f's respectively, and replace by properly referenced and back-referencing endnotes at the end of the chapter.

function zx_footnotes {
# Usage: zx_footnotes [text file]
# All @'s are replaced by links to corresponding footnote.
# All footnotes indicated by %f are converted to end-notes at the end of
# the chapter.
# Footnotes which span more than one page must be collected on a single page.
# The end-notes have links back to the original reference.
# The @'s can have a number appended to them for control purpose, but
# they are not used y this function.
# If no input file is given; input is read from STDIN.
# Output is to STDOUT
[ $1 ] && inputfile=$1 || inputfile="/dev/stdin"
awk '
BEGIN {n=0;r=1;cn=1}
/@/ {sub(/@[0-9]*/, sprintf("<a name=\"R%2.2d_%3.3d\"/><a href=\"#F%2.2d_%3.3d\" class=\"footnote\">%d)</a>",cn,r,cn,r,r));r++}
/^%f/ {fn=1;n++;sub("%f","")}
/^%[eP]/ {fn=0}
/^%e/ { if (n>0) {
print "<h3 class=\"footnoteheader\">Footnotes</h3>"
print "<dl class=\"footnotelist\">"
for (i=1; i<=n;i++) {
printf(" <dt><a name=\"F%2.2d_%3.3d\"/><a href=\"#R%2.2d_%3.3d\">%d)</a></dt>" ,cn,i,cn,i,i)
print "<dd>",fns,"</dd>"
}
print "</dl>";
n=0;r=1;cn++;
delete fns;
}
$0="<hr class=\"endchapter\" />";
}
{if (fn>0) {fns[n]=fns[n]$0} else print}
' $inputfile
}

No more than 999 footnotes pr. chapter or 99 chapters, please.

Why the prefix zx_ in zx_footnotes?

It's probably a good idea to have some kind of common, fairly unique prefix to [I]all of the functions listed here. Makes tab-completion simpler, for one thing.
zx is easy to type
If i wanted to have everything lucid and self-explanatory, I wouldn't be using bash, sed, and awk in the first place, would I? ;)

SBT
12-13-2011, 02:02 PM
Time to convert our tagged file to xhtml:

function zx_txt2xhtml {
# Usage zx_txt2html [textfile]
# Converts a text-file with %-type tags to an xhtml file.
# The file should be run through html tidy afterwards.
# If no input file is given; input is read from STDIN.
# Output is to STDOUT
#-e '/^%q/,/^%[^q]/{s/^%q[ \t]\+/<div class="intro">/;s/%[^q]/<\/div>\n&/}' |\
[ $1 ] && inputfile=$1 || inputfile="/dev/stdin"
cat $inputfile |\
sed -e s/"^%c \(.*\)"/"<\/p>\n<hr class=\"endchapter\"\/>\n\n<h2 class=\"chapter\">\1<\/h2>"/ |\
sed -e s/"^%y[ ]\+\([^A-Z0-9]*[A-Z0-9]\)\([^ ]*\)"/"<p class=\"initial\"><span class=\"drop\">\1<\/span><span class=\"first\">\2<\/span>"/ \
-e s/"^\( \{6,8\}\|\t\)"/"<\/p>\n<p>"/ \
-e s/"#-"/"-"/g \
-e /"^%[pPiw].*"/s/".*"/"<!-- & -->"/ |\
sed /"^$"/d |\
sed -e s/"<span class=\"drop\">\(.*\)\([AL]\)<\/span><span class=\"first\">"/"<span class=\"drop\">\1\2<\/span><span class=\"after\2\">"/ \
-e 1i'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">\
<html xmlns="http://www.w3.org/1999/xhtml">\
<head> \
<meta http-equiv="Content-Type" content="text/html; charset=utf8" /> \
<title></title> \n\
<link href="main.css" rel="stylesheet" type="text/css" /> </head> \
<body>' \
-e \$a"</body>\n</html>"
}

I suggest the following tidy command:
tidy -asxhtml -utf8
Personally I like to also use the -e option, and correct errors by hand. I don't trust tidy to not be overly enthusiastic in its tidiness.
All the %-tags which are not converted to html tags are enclosed in comments. No need to remove information unless you have to.

At this point we should have a nice, well-formatted xhtml file, all ready to be fed into Sigil or Calibre. Or about a dozen other epub creation tools.
Or we can bloody-mindedly finish as we started, and just make a few more bash functions to arrive at a complete epub file...