Thanks for feedback, Jellby and DiapDealer. I’ve updated the function accordingly. Good proofreading patterns are probably worthy a thread of their own.
Anyhow, on to todays task:
When proofreading an OCR text, it’s a necessity to have the scanned page images side by side with the text. Of course you can open the djvu/pdf file in a viewer and the text in a separate editor, but it is a trifle tiresome to hop back and forth between them to synchronize page viewing. In a previous
post I presented a script to combine the images and text in an HTML table, which then could be imported into LibreOffice and edited there.
A slightly revised version is shown below.
First, a directory is filled with the page images extracted from the djvu-file. As this is a time-consuming operation, it is devolved to a separate function. This version assumes the book is less than 1000 pages. It also scales down the image, and clips it. This clipping is probably book-dependent, and the coordinates are probably possible to extract from the djvu-file, but finding out how is on the TODO list.
Required tools: netpbm and cjpeg.
Code:
function extractpagescans {
# Usage: extractpagescans <djvufile>.
# Creates a jpeg-file of each page, and stores it in directory "pages"
mkdir pages
n=$(djvused $1 -e 'n')
for x in $(seq $n)
do
ddjvu -format=ppm -page=$x -segment=1700x2850+200+200 $1 - |pnmscale 0.5 | cjpeg -quality 35 -smooth 50 -scale "1/2" -optimize>$(printf "pages/%3.3d.jpg" $x)
echo $x
done
}
Then the HTML file is constructed. The text on each page is enclosed between <pre></pre> tags to preserve line breaks and other formatting.
Code:
function makeproofreadhtml {
#Usage: makeproofreadhtml <textfile>
#creates a html file with a two-column table, page scans to the left, OCR text t
o the right.
imgdir=pages
awk -v img="$(basename $imgdir)" '
BEGIN {
# Set form feed (new page) control character as record separator
RS="%P"
charset="utf8"
# html header
print "<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.1//EN\" \"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd\">\n\
<html xmlns=\"http://www.w3.org/1999/xhtml\">\n\
<head>\n\
<meta http-equiv=\"content-type\" content=\"text/html; charset="charset"\" />\n\
</head>\n\
<body>\n\
<table>"
}
{
# substitute <,>,& characters with html codes
gsub("&","&")
gsub("<","<")
gsub(">",">")
# add scan image in left column
print "<tr><td>"
printf("<img width=\"500\" src=\"%s/%3.3d.jpg\">", img, NR-1)
print "</td>"
# add text as preformatted text, preserving line breaks etc., in right column
print "<td>"
# add PDF file page number as cell header, embedded in HTML comment
print "<pre>"
printf("%%G ")
print
print "</pre>"
print "</td></tr>"
}
END {
# wrap up html file
print "</table>"
print "</body>"
print "</html>"
}' $1\
> ${1%txt}html # output to html file
}
The HTML file can then be read by LibreOffice, and you can start correcting all those ~’s.
I also change the font to italic and bold where indicated in the scans while in LibreOffice, but for other types of formatting I prefer to use %-type mnemonics.
Next: Extracting the text from this HTML-file, handling mnemonics and footnotes, and producing an XHTML-compliant file.