MobileRead Forums - View Single Post

SBT · 12-02-2011, 03:28 PM

Being rather old-fashioned, and convinced that the current obsession with graphical user interfaces is just a passing fad, I make my epub-formatted books within the comfort of a unix bash shell. During their production, I’ve gradually developed a handful of tricks and functions to speed up things that may possibly appeal to others, though I've very probably reinvented a few wheels along the way.

Roughly speaking, my epub-making recipe is as follows:

Extract text from djvu file & handle formatting characters
Identify and possibly correct obvious OCR errors
Extract each page as image. Create html file with image & text side by side.
Edit this in LibreOffice; proofread, mark headings, footnotes etc.
Remove page images from file, handle book structure
Handle footnotes.
Handle words split over lines.
Generate epub file/directory structure
Split html file into chapters.
Generate toc & manifest.
zip & verify

The main audience for this post is, I suppose, people who like me also like to tinker, poke, and generally mess around with their e-books to get them just so, but the bits which partly automate the drudgery of proofreading and editing may prove of interest to those who are (very understandably!) satisfied with Calibre and Sigil.
I thought it would be nice to create a thread to present the details; then I can present the recipe one step at the time, and with a bit of luck someone will point out how I could solve the various problems even more efficiently. The thread can then possibly be used as a source for making a nice HowTo.

The book I’m currently working on is Elisha K. Kane: The Second Grinnell Expedition, Vol. II (source:Internet Archive), so I’ll use that as a case study.
Off we go: First, extract text from the djvu-file.
Required tools: DjvuLibre.
(To use this code snippet, save it as a file, e.g. ‘epubtools.sh’. Type ‘source epubtools.sh’, and you can use ‘extracttext <djvufile>’ like any other command.)

Code:

function extracttext {
# Usage: extracttext <filename.djvu>. Outputs textfile to filename.txt
n=$(djvused $1 -e 'n')  # Find total pagenumber
f=${1%.djvu}.txt        # Output file name
rm -i $f                # Interactively delete existing output file
for x in $(seq $n)      # foreach page
do
echo "%P $x" >> $f      # write %P <pageno> to file before page content
# Get page, replace vertical tab -> %K, unit separator -> %_,
#    unit separator -> tab,
#    unit + group separators -> %L, remove form feed, drop last line.
# Replace unit + group sep. + multiple vert. tabs with %i <pageno>,
#     indicates image caption. Remove empty lines/extraneous format chars.
# Prepend %p to first line, indicating page header.
# Prepend %n to footer page number/ volume indicator.
djvutxt --pages=$x $1 |\
sed -e s/"^K"/"%K"/g \
-e s/"^_^]"/"%L"/g \
-e s/"^_"/"     "/g \
-e s/"^L"/""/g \
-e \$d |\
sed -e s/"^%L\(%K\)\{2,22\}"/"%i $x "/ \
-e s/"%L%K"// \
-e s/"^[        ]*\([0-9]\{2,3\}\|[Vv][oO0][lL1].*\) *$"/"%n &"/ \
-e 1s/"^"/"%p "/ \
-e /"^$"/d \
>> $f
printf "Page: %3d/%d\r" $x $n # send progress status to STDOUT
done
echo
}

The observant reader may wonder why I extract the text page by page and don’t simply dump the entire text file at once with djvutxt *.djvu? The reason is that djvutxt doesn’t produce page breaks (\f) for blank pages, and I wish to keep a record of the djvu page number.
djvu uses vertical tab, group indicator, and form feed control characters, these are transcribed to readable chars, and interpreted as indicated in the script.
Why not insert html codes like <p> instead of tabs? Because while proofreading, I like to keep the file as close to pure text as possible. Instead, I use home-brewed mnemonics (I sense people shuddering), %P for page breaks, %p for page headings, etc. This is fairly unobtrusive and easy to filter with various tools.
This is not a universal tool for handling djvu-files; if for example there are no page headers, or if there are footers instead, it won’t work well. However, I suspect it’s simpler to learn sufficient shell scripting to modify the script to match different book formats than to learn how to use a monstrous everything-to-all-books function with umpteen settings and options.
Next time: Identifying and correcting OCR errors.

12-02-2011, 03:28 PM	#1
SBT Fanatic Posts: 580 Karma: 810184 Join Date: Sep 2010 Location: Norway Device: prs-t1, tablet, Nook Simple, assorted kindles, iPad	Epub creation in unix shell Being rather old-fashioned, and convinced that the current obsession with graphical user interfaces is just a passing fad, I make my epub-formatted books within the comfort of a unix bash shell. During their production, I’ve gradually developed a handful of tricks and functions to speed up things that may possibly appeal to others, though I've very probably reinvented a few wheels along the way. Roughly speaking, my epub-making recipe is as follows: Extract text from djvu file & handle formatting characters Identify and possibly correct obvious OCR errors Extract each page as image. Create html file with image & text side by side. Edit this in LibreOffice; proofread, mark headings, footnotes etc. Remove page images from file, handle book structure Handle footnotes. Handle words split over lines. Generate epub file/directory structure Split html file into chapters. Generate toc & manifest. zip & verify The main audience for this post is, I suppose, people who like me also like to tinker, poke, and generally mess around with their e-books to get them just so, but the bits which partly automate the drudgery of proofreading and editing may prove of interest to those who are (very understandably!) satisfied with Calibre and Sigil. I thought it would be nice to create a thread to present the details; then I can present the recipe one step at the time, and with a bit of luck someone will point out how I could solve the various problems even more efficiently. The thread can then possibly be used as a source for making a nice HowTo. The book I’m currently working on is Elisha K. Kane: The Second Grinnell Expedition, Vol. II (source:Internet Archive), so I’ll use that as a case study. Off we go: First, extract text from the djvu-file. Required tools: DjvuLibre. (To use this code snippet, save it as a file, e.g. ‘epubtools.sh’. Type ‘source epubtools.sh’, and you can use ‘extracttext <djvufile>’ like any other command.) Code: function extracttext { # Usage: extracttext <filename.djvu>. Outputs textfile to filename.txt n=$(djvused $1 -e 'n') # Find total pagenumber f=${1%.djvu}.txt # Output file name rm -i $f # Interactively delete existing output file for x in $(seq $n) # foreach page do echo "%P $x" >> $f # write %P <pageno> to file before page content # Get page, replace vertical tab -> %K, unit separator -> %_, # unit separator -> tab, # unit + group separators -> %L, remove form feed, drop last line. # Replace unit + group sep. + multiple vert. tabs with %i <pageno>, # indicates image caption. Remove empty lines/extraneous format chars. # Prepend %p to first line, indicating page header. # Prepend %n to footer page number/ volume indicator. djvutxt --pages=$x $1 \|\ sed -e s/"^K"/"%K"/g \ -e s/"^_^]"/"%L"/g \ -e s/"^_"/" "/g \ -e s/"^L"/""/g \ -e \$d \|\ sed -e s/"^%L\(%K\)\{2,22\}"/"%i $x "/ \ -e s/"%L%K"// \ -e s/"^[ ]\([0-9]\{2,3\}\\|[Vv][oO0][lL1].\) $"/"%n &"/ \ -e 1s/"^"/"%p "/ \ -e /"^$"/d \ >> $f printf "Page: %3d/%d\r" $x $n # send progress status to STDOUT done echo } The observant reader may wonder why I extract the text page by page and don’t simply dump the entire text file at once with djvutxt .djvu? The reason is that djvutxt doesn’t produce page breaks (\f) for blank pages, and I wish to keep a record of the djvu page number. djvu uses vertical tab, group indicator, and form feed control characters, these are transcribed to readable chars, and interpreted as indicated in the script. Why not insert html codes like <p> instead of tabs? Because while proofreading, I like to keep the file as close to pure text as possible. Instead, I use home-brewed mnemonics (I sense people shuddering), %P for page breaks, %p for page headings, etc. This is fairly unobtrusive and easy to filter with various tools. This is not a universal tool for handling djvu-files; if for example there are no page headers, or if there are footers instead, it won’t work well. However, I suspect it’s simpler to learn sufficient shell scripting to modify the script to match different book formats than to learn how to use a monstrous everything-to-all-books function with umpteen settings and options. Next time: Identifying and correcting OCR errors.