View Single Post
Old 12-04-2011, 04:12 PM   #6
SBT
Fanatic
SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.
 
SBT's Avatar
 
Posts: 580
Karma: 810184
Join Date: Sep 2010
Location: Norway
Device: prs-t1, tablet, Nook Simple, assorted kindles, iPad
Page scans and OCR text side by side

Thanks for feedback, Jellby and DiapDealer. I’ve updated the function accordingly. Good proofreading patterns are probably worthy a thread of their own.

Anyhow, on to todays task:
When proofreading an OCR text, it’s a necessity to have the scanned page images side by side with the text. Of course you can open the djvu/pdf file in a viewer and the text in a separate editor, but it is a trifle tiresome to hop back and forth between them to synchronize page viewing. In a previous post I presented a script to combine the images and text in an HTML table, which then could be imported into LibreOffice and edited there.
A slightly revised version is shown below.
First, a directory is filled with the page images extracted from the djvu-file. As this is a time-consuming operation, it is devolved to a separate function. This version assumes the book is less than 1000 pages. It also scales down the image, and clips it. This clipping is probably book-dependent, and the coordinates are probably possible to extract from the djvu-file, but finding out how is on the TODO list.
Required tools: netpbm and cjpeg.
Code:
function extractpagescans {
# Usage: extractpagescans <djvufile>. 
# Creates a jpeg-file of each page, and stores it in directory "pages"
   mkdir pages
   n=$(djvused $1 -e 'n')
   for x in $(seq $n)
   do
      ddjvu  -format=ppm -page=$x -segment=1700x2850+200+200 $1 - |pnmscale 0.5 | cjpeg -quality 35 -smooth 50  -scale "1/2" -optimize>$(printf "pages/%3.3d.jpg" $x)
   echo $x
   done
}
Then the HTML file is constructed. The text on each page is enclosed between <pre></pre> tags to preserve line breaks and other formatting.
Code:
function makeproofreadhtml {
#Usage: makeproofreadhtml <textfile>
#creates a html file with a two-column table, page scans to the left, OCR text t
o the right.
   imgdir=pages
   awk  -v img="$(basename $imgdir)" '
      BEGIN {
         # Set form feed (new page) control character as record separator
         RS="%P"
         charset="utf8"
         # html header 
         print "<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.1//EN\" \"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd\">\n\
         <html xmlns=\"http://www.w3.org/1999/xhtml\">\n\
         <head>\n\
         <meta http-equiv=\"content-type\" content=\"text/html; charset="charset"\" />\n\
         </head>\n\
         <body>\n\
         <table>"
      }
      
      {
         # substitute <,>,& characters with html codes
         gsub("&","&amp;")
         gsub("<","&lt;")
         gsub(">","&gt;")
         # add scan image in left column
         print "<tr><td>"
         printf("<img width=\"500\" src=\"%s/%3.3d.jpg\">", img, NR-1)
         print "</td>"
      
         # add text as preformatted text, preserving line breaks etc., in right column
         print "<td>"
         # add PDF file page number as cell header, embedded in HTML comment
         print "<pre>"
         printf("%%G ")
         print
         print "</pre>"
         print "</td></tr>"
      }
      
      END {
         # wrap up html file
         print "</table>"
         print "</body>"
         print "</html>"
      }' $1\
   > ${1%txt}html # output to html file
}
The HTML file can then be read by LibreOffice, and you can start correcting all those ~’s.
I also change the font to italic and bold where indicated in the scans while in LibreOffice, but for other types of formatting I prefer to use %-type mnemonics.

Next: Extracting the text from this HTML-file, handling mnemonics and footnotes, and producing an XHTML-compliant file.

Last edited by SBT; 12-04-2011 at 04:12 PM. Reason: typo
SBT is offline   Reply With Quote