12-02-2011, 03:28 PM | #1 |
Fanatic
Posts: 580
Karma: 810184
Join Date: Sep 2010
Location: Norway
Device: prs-t1, tablet, Nook Simple, assorted kindles, iPad
|
Epub creation in unix shell
Being rather old-fashioned, and convinced that the current obsession with graphical user interfaces is just a passing fad, I make my epub-formatted books within the comfort of a unix bash shell. During their production, I’ve gradually developed a handful of tricks and functions to speed up things that may possibly appeal to others, though I've very probably reinvented a few wheels along the way.
Roughly speaking, my epub-making recipe is as follows:
I thought it would be nice to create a thread to present the details; then I can present the recipe one step at the time, and with a bit of luck someone will point out how I could solve the various problems even more efficiently. The thread can then possibly be used as a source for making a nice HowTo. The book I’m currently working on is Elisha K. Kane: The Second Grinnell Expedition, Vol. II (source:Internet Archive), so I’ll use that as a case study. Off we go: First, extract text from the djvu-file. Required tools: DjvuLibre. (To use this code snippet, save it as a file, e.g. ‘epubtools.sh’. Type ‘source epubtools.sh’, and you can use ‘extracttext <djvufile>’ like any other command.) Code:
function extracttext { # Usage: extracttext <filename.djvu>. Outputs textfile to filename.txt n=$(djvused $1 -e 'n') # Find total pagenumber f=${1%.djvu}.txt # Output file name rm -i $f # Interactively delete existing output file for x in $(seq $n) # foreach page do echo "%P $x" >> $f # write %P <pageno> to file before page content # Get page, replace vertical tab -> %K, unit separator -> %_, # unit separator -> tab, # unit + group separators -> %L, remove form feed, drop last line. # Replace unit + group sep. + multiple vert. tabs with %i <pageno>, # indicates image caption. Remove empty lines/extraneous format chars. # Prepend %p to first line, indicating page header. # Prepend %n to footer page number/ volume indicator. djvutxt --pages=$x $1 |\ sed -e s/"^K"/"%K"/g \ -e s/"^_^]"/"%L"/g \ -e s/"^_"/" "/g \ -e s/"^L"/""/g \ -e \$d |\ sed -e s/"^%L\(%K\)\{2,22\}"/"%i $x "/ \ -e s/"%L%K"// \ -e s/"^[ ]*\([0-9]\{2,3\}\|[Vv][oO0][lL1].*\) *$"/"%n &"/ \ -e 1s/"^"/"%p "/ \ -e /"^$"/d \ >> $f printf "Page: %3d/%d\r" $x $n # send progress status to STDOUT done echo } djvu uses vertical tab, group indicator, and form feed control characters, these are transcribed to readable chars, and interpreted as indicated in the script. Why not insert html codes like <p> instead of tabs? Because while proofreading, I like to keep the file as close to pure text as possible. Instead, I use home-brewed mnemonics (I sense people shuddering), %P for page breaks, %p for page headings, etc. This is fairly unobtrusive and easy to filter with various tools. This is not a universal tool for handling djvu-files; if for example there are no page headers, or if there are footers instead, it won’t work well. However, I suspect it’s simpler to learn sufficient shell scripting to modify the script to match different book formats than to learn how to use a monstrous everything-to-all-books function with umpteen settings and options. Next time: Identifying and correcting OCR errors. |
12-03-2011, 02:52 PM | #2 |
Fanatic
Posts: 580
Karma: 810184
Join Date: Sep 2010
Location: Norway
Device: prs-t1, tablet, Nook Simple, assorted kindles, iPad
|
Finding and correcting OCR errors
OCR does stumble here and there. So does human proofreading, it’s easy to miss the occasional mispnirt... However, with some careful thought, it is possible to construct search patterns which identify a decent proportion of them.
The first category of errors is the one that can be automatically corrected. Spaces before punctuation like ;:,?! can be safely removed, as can spaces after quotation marks at the start of a line or before them at the end of a line. Likewise ‘ tlie’ can confidently be replaced with ‘ the’, and ‘ m ‘ with ‘ in ‘. Code:
function tlie_m_punctuationclean { # Usage: tlie_m_punctuationclean <text file>. # Autocorrects in-place some OCR errors. sed -i -e s/" *\([:?!;]\)"/"\1"/g \ -e s/"\(^ *\| \)\" \+"/"\1\""/ \ -e s/" \+\" \+$"/"\""/ \ -e s/"\([ ][Tt]\|^[Tt]\)lie"/"\1he"/ \ -e s/" m "/" in "/ \ $* sed -i -e s/"\"'"/"\"\ '"/g \ -e s/"'\""/"'\ \""/g \ $* } Code:
function marksuspects { # Usage: marksuspects <text file>. Prepends a ^ in front of words that need # correction. Edits in-place. sed -i s/"\([^ ]*\)\([a-z][A-Z0-9]\|[A-Za-z][(){\[\]}.,;:?!][A-Za-z]\|q[^u]\)"/"~\1\2"/g $1 } Next: Combining page scans with page text. Last edited by SBT; 12-04-2011 at 03:02 PM. Reason: Updated tlie_m_punctuationclean acc. to sug. frm. DiapDealer & Jellby |
Advert | |
|
12-03-2011, 04:04 PM | #3 | |
Grand Sorcerer
Posts: 27,549
Karma: 193191846
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
|
Quote:
|
|
12-04-2011, 05:03 AM | #4 |
frumious Bandersnatch
Posts: 7,516
Karma: 18512745
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
|
I always try to remove those spaces I think that should be handled by font kerning (and I've modified my preferred font to add kerning pairs between quotes). In any case, those spaces should be non-breaking and thin if possible (& #8239;)
|
12-04-2011, 07:42 AM | #5 | |
Grand Sorcerer
Posts: 27,549
Karma: 193191846
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
|
Quote:
Last edited by DiapDealer; 12-04-2011 at 08:21 AM. |
|
Advert | |
|
12-04-2011, 04:12 PM | #6 |
Fanatic
Posts: 580
Karma: 810184
Join Date: Sep 2010
Location: Norway
Device: prs-t1, tablet, Nook Simple, assorted kindles, iPad
|
Page scans and OCR text side by side
Thanks for feedback, Jellby and DiapDealer. I’ve updated the function accordingly. Good proofreading patterns are probably worthy a thread of their own.
Anyhow, on to todays task: When proofreading an OCR text, it’s a necessity to have the scanned page images side by side with the text. Of course you can open the djvu/pdf file in a viewer and the text in a separate editor, but it is a trifle tiresome to hop back and forth between them to synchronize page viewing. In a previous post I presented a script to combine the images and text in an HTML table, which then could be imported into LibreOffice and edited there. A slightly revised version is shown below. First, a directory is filled with the page images extracted from the djvu-file. As this is a time-consuming operation, it is devolved to a separate function. This version assumes the book is less than 1000 pages. It also scales down the image, and clips it. This clipping is probably book-dependent, and the coordinates are probably possible to extract from the djvu-file, but finding out how is on the TODO list. Required tools: netpbm and cjpeg. Code:
function extractpagescans { # Usage: extractpagescans <djvufile>. # Creates a jpeg-file of each page, and stores it in directory "pages" mkdir pages n=$(djvused $1 -e 'n') for x in $(seq $n) do ddjvu -format=ppm -page=$x -segment=1700x2850+200+200 $1 - |pnmscale 0.5 | cjpeg -quality 35 -smooth 50 -scale "1/2" -optimize>$(printf "pages/%3.3d.jpg" $x) echo $x done } Code:
function makeproofreadhtml { #Usage: makeproofreadhtml <textfile> #creates a html file with a two-column table, page scans to the left, OCR text t o the right. imgdir=pages awk -v img="$(basename $imgdir)" ' BEGIN { # Set form feed (new page) control character as record separator RS="%P" charset="utf8" # html header print "<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.1//EN\" \"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd\">\n\ <html xmlns=\"http://www.w3.org/1999/xhtml\">\n\ <head>\n\ <meta http-equiv=\"content-type\" content=\"text/html; charset="charset"\" />\n\ </head>\n\ <body>\n\ <table>" } { # substitute <,>,& characters with html codes gsub("&","&") gsub("<","<") gsub(">",">") # add scan image in left column print "<tr><td>" printf("<img width=\"500\" src=\"%s/%3.3d.jpg\">", img, NR-1) print "</td>" # add text as preformatted text, preserving line breaks etc., in right column print "<td>" # add PDF file page number as cell header, embedded in HTML comment print "<pre>" printf("%%G ") print print "</pre>" print "</td></tr>" } END { # wrap up html file print "</table>" print "</body>" print "</html>" }' $1\ > ${1%txt}html # output to html file } I also change the font to italic and bold where indicated in the scans while in LibreOffice, but for other types of formatting I prefer to use %-type mnemonics. Next: Extracting the text from this HTML-file, handling mnemonics and footnotes, and producing an XHTML-compliant file. Last edited by SBT; 12-04-2011 at 04:12 PM. Reason: typo |
12-09-2011, 11:50 PM | #7 |
Avid Reader
Posts: 161
Karma: 36472
Join Date: Sep 2008
Location: Look for rain, hail and snow...
Device: PRS-505, PRS-600, PRS T1, Kobo Glo
|
SBT, let me please thank you, this is a very instructive post and now I need to sit down...
|
12-10-2011, 01:32 PM | #8 |
Fanatic
Posts: 580
Karma: 810184
Join Date: Sep 2010
Location: Norway
Device: prs-t1, tablet, Nook Simple, assorted kindles, iPad
|
convert from proof-reading html back to text
@opitz: Thanks for your kind words; any kind of feedback is welcome.
When you've finished proofreading in LibreOffice, or just want to return to editing in a pure text editor, you can use the following function which is the reverse of makeproofread (which I think I'll rename txt2proof, so there'll be some consistency. I thought it would be a good idea to read input from STDIN if no filename is given; I'll probably add that functionality to all functions where appropriate Code:
function proof2txt { # Usage: proof2txt [inputfile.html]. # If no inputfile, input is read from STDIN. # Output written to STDOUT [ $1 ] && inputfile=$1 || inputfile="/dev/stdin" # Handle text marked as italic/bold. # LibreOffice inserts </I> and <I> (ditto for bold) at the end and beginning # of italic sections than spans several lines. # Enclosing <..> tags are replaced by html-encoded < & > for italics/bold. sed '1h;1!H;${g;s/<\/I>\n<I>/\n/g;s/<\/B>\n<B>/\n/g;p;}' $inputfile |\ sed s/'<\(\/\?[BI]\)>'/'\<\1\>'/g |\ lynx -dump -stdin |\ grep -v "^ \[[0-9]\{3,3\}\.jpg\]" } Last edited by SBT; 12-10-2011 at 01:36 PM. |
12-10-2011, 05:01 PM | #9 |
Fanatic
Posts: 580
Karma: 810184
Join Date: Sep 2010
Location: Norway
Device: prs-t1, tablet, Nook Simple, assorted kindles, iPad
|
Handling words split over lines
Words which are split over lines must be rejoined.
The following function handles this, and also words split over pages and by images. The hyphen is replaced by '#-', then you can manually inspect which hyphens should be retained. Code:
function removehyphens { # Usage: removehyphens [inputfile.txt] # takes words split over two lines and prepends it at the start of the # next text line. Replaces the hyphen with '#-' for manual inspection # Removes hyphens that are probably redundant, and confirms some which are # correct. # If no inputfile, input is read from STDIN. # Output written to STDOUT [ $1 ] && inputfile=$1 || inputfile="/dev/stdin" awk '/^ *[a-z]/ {printf("%s",hyph);sub(/^ */,"");hyph="";}\ {if (/[a-z-]- *$/) {hyph=$NF;$NF="";sub(/- *$/,"#-",hyph)};\ print;}' $inputfile |\ sed -e s/"^\(..\)#-"/"\1"/ \ -e s/"#-\(ing\|ment\)"/"\1"/ \ -e s/"\(twenty\|thirty\|forty\|fifty\|sixty\|seventy\|eighty\|ninety\)#-"/"\1-"/ } Last edited by SBT; 12-12-2011 at 03:21 PM. Reason: properly handle double hyphens at line ends |
12-11-2011, 03:09 PM | #10 |
Fanatic
Posts: 580
Karma: 810184
Join Date: Sep 2010
Location: Norway
Device: prs-t1, tablet, Nook Simple, assorted kindles, iPad
|
Mnemonics (or do I mean tags?)
I'm too lazy to write out html tags unless I absolutely have to. Therefore I use tags to indicate document properties, and convert them to html at the end of the formatting process. The general rule is that tags consist of % at the beginning of the line, followed by a single character.
Why have an end chapter tag as well as a begin chapter tag? In most cases, this is superfluous, but in some cases there can be text or pictures before the chapter heading proper. Here's a sample tagged document: Code:
%P 1 MY GREAT NOVEL BY Long-forgotten author %P 2 dedications, contents and stuff %e %P 2 %p INTRODUCTION 1 %c CHAPTER I %y It was a dark and stormy night. Suddenly, a voice cried out. %i 2 A stormy illustration %P 3 %p INTRODUCTION 2 Why this voice@ cried out, nobody could adequately explain then and there. %f Though it was generally agreed to be a female voice. %P 4 %p INTRODUCTION 3 Thus the setting for this novel should have been set. %e %P 5 %c CHAPTER II %y A glorious morning spread happiness and joy... |
12-12-2011, 05:27 PM | #11 |
Fanatic
Posts: 580
Karma: 810184
Join Date: Sep 2010
Location: Norway
Device: prs-t1, tablet, Nook Simple, assorted kindles, iPad
|
Handling footnotes
And for my next trick:
Take footnote references and footnotes, indicated by @'s and %f's respectively, and replace by properly referenced and back-referencing endnotes at the end of the chapter. Code:
function zx_footnotes { # Usage: zx_footnotes [text file] # All @'s are replaced by links to corresponding footnote. # All footnotes indicated by %f are converted to end-notes at the end of # the chapter. # Footnotes which span more than one page must be collected on a single page. # The end-notes have links back to the original reference. # The @'s can have a number appended to them for control purpose, but # they are not used y this function. # If no input file is given; input is read from STDIN. # Output is to STDOUT [ $1 ] && inputfile=$1 || inputfile="/dev/stdin" awk ' BEGIN {n=0;r=1;cn=1} /@/ {sub(/@[0-9]*/, sprintf("<a name=\"R%2.2d_%3.3d\"/><a href=\"#F%2.2d_%3.3d\" class=\"footnote\">%d)</a>",cn,r,cn,r,r));r++} /^%f/ {fn=1;n++;sub("%f","")} /^%[eP]/ {fn=0} /^%e/ { if (n>0) { print "<h3 class=\"footnoteheader\">Footnotes</h3>" print "<dl class=\"footnotelist\">" for (i=1; i<=n;i++) { printf(" <dt><a name=\"F%2.2d_%3.3d\"/><a href=\"#R%2.2d_%3.3d\">%d)</a></dt>" ,cn,i,cn,i,i) print "<dd>",fns[i],"</dd>" } print "</dl>"; n=0;r=1;cn++; delete fns; } $0="<hr class=\"endchapter\" />"; } {if (fn>0) {fns[n]=fns[n]$0} else print} ' $inputfile } Why the prefix zx_ in zx_footnotes?
|
12-13-2011, 01:02 PM | #12 |
Fanatic
Posts: 580
Karma: 810184
Join Date: Sep 2010
Location: Norway
Device: prs-t1, tablet, Nook Simple, assorted kindles, iPad
|
convert to xhtml
Time to convert our tagged file to xhtml:
Code:
function zx_txt2xhtml { # Usage zx_txt2html [textfile] # Converts a text-file with %-type tags to an xhtml file. # The file should be run through html tidy afterwards. # If no input file is given; input is read from STDIN. # Output is to STDOUT #-e '/^%q/,/^%[^q]/{s/^%q[ \t]\+/<div class="intro">/;s/%[^q]/<\/div>\n&/}' |\ [ $1 ] && inputfile=$1 || inputfile="/dev/stdin" cat $inputfile |\ sed -e s/"^%c \(.*\)"/"<\/p>\n<hr class=\"endchapter\"\/>\n\n<h2 class=\"chapter\">\1<\/h2>"/ |\ sed -e s/"^%y[ ]\+\([^A-Z0-9]*[A-Z0-9]\)\([^ ]*\)"/"<p class=\"initial\"><span class=\"drop\">\1<\/span><span class=\"first\">\2<\/span>"/ \ -e s/"^\( \{6,8\}\|\t\)"/"<\/p>\n<p>"/ \ -e s/"#-"/"-"/g \ -e /"^%[pPiw].*"/s/".*"/"<!-- & -->"/ |\ sed /"^$"/d |\ sed -e s/"<span class=\"drop\">\(.*\)\([AL]\)<\/span><span class=\"first\">"/"<span class=\"drop\">\1\2<\/span><span class=\"after\2\">"/ \ -e 1i'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">\ <html xmlns="http://www.w3.org/1999/xhtml">\ <head> \ <meta http-equiv="Content-Type" content="text/html; charset=utf8" /> \ <title></title> \n\ <link href="main.css" rel="stylesheet" type="text/css" /> </head> \ <body>' \ -e \$a"</body>\n</html>" } Code:
tidy -asxhtml -utf8 All the %-tags which are not converted to html tags are enclosed in comments. No need to remove information unless you have to. At this point we should have a nice, well-formatted xhtml file, all ready to be fed into Sigil or Calibre. Or about a dozen other epub creation tools. Or we can bloody-mindedly finish as we started, and just make a few more bash functions to arrive at a complete epub file... |
Tags |
command line, shell |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
read epub from a shell | farvardin | ePub | 18 | 01-23-2012 03:14 PM |
tools for epub creation | Toxaris | ePub | 15 | 03-05-2010 04:54 AM |
on-the-fly epub creation | ilovejedd | ePub | 19 | 04-16-2009 07:36 PM |
epub creation tools | jbenny | ePub | 20 | 03-13-2009 12:30 PM |
ePub creation / conversion | philippd | ePub | 8 | 06-04-2008 04:23 AM |