MobileRead Forums - View Single Post

SBT · 12-04-2011, 04:12 PM

Thanks for feedback, Jellby and DiapDealer. I’ve updated the function accordingly. Good proofreading patterns are probably worthy a thread of their own.

Anyhow, on to todays task:
When proofreading an OCR text, it’s a necessity to have the scanned page images side by side with the text. Of course you can open the djvu/pdf file in a viewer and the text in a separate editor, but it is a trifle tiresome to hop back and forth between them to synchronize page viewing. In a previous post I presented a script to combine the images and text in an HTML table, which then could be imported into LibreOffice and edited there.
A slightly revised version is shown below.
First, a directory is filled with the page images extracted from the djvu-file. As this is a time-consuming operation, it is devolved to a separate function. This version assumes the book is less than 1000 pages. It also scales down the image, and clips it. This clipping is probably book-dependent, and the coordinates are probably possible to extract from the djvu-file, but finding out how is on the TODO list.
Required tools: netpbm and cjpeg.

Code:

function extractpagescans {
# Usage: extractpagescans <djvufile>. 
# Creates a jpeg-file of each page, and stores it in directory "pages"
   mkdir pages
   n=$(djvused $1 -e 'n')
   for x in $(seq $n)
   do
      ddjvu  -format=ppm -page=$x -segment=1700x2850+200+200 $1 - |pnmscale 0.5 | cjpeg -quality 35 -smooth 50  -scale "1/2" -optimize>$(printf "pages/%3.3d.jpg" $x)
   echo $x
   done
}

Then the HTML file is constructed. The text on each page is enclosed between <pre></pre> tags to preserve line breaks and other formatting.

Code:

function makeproofreadhtml {
#Usage: makeproofreadhtml <textfile>
#creates a html file with a two-column table, page scans to the left, OCR text t
o the right.
   imgdir=pages
   awk  -v img="$(basename $imgdir)" '
      BEGIN {
         # Set form feed (new page) control character as record separator
         RS="%P"
         charset="utf8"
         # html header 
         print "<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.1//EN\" \"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd\">\n\
         <html xmlns=\"http://www.w3.org/1999/xhtml\">\n\
         <head>\n\
         <meta http-equiv=\"content-type\" content=\"text/html; charset="charset"\" />\n\
         </head>\n\
         <body>\n\
         <table>"
      }
      
      {
         # substitute <,>,& characters with html codes
         gsub("&","&amp;")
         gsub("<","&lt;")
         gsub(">","&gt;")
         # add scan image in left column
         print "<tr><td>"
         printf("<img width=\"500\" src=\"%s/%3.3d.jpg\">", img, NR-1)
         print "</td>"
      
         # add text as preformatted text, preserving line breaks etc., in right column
         print "<td>"
         # add PDF file page number as cell header, embedded in HTML comment
         print "<pre>"
         printf("%%G ")
         print
         print "</pre>"
         print "</td></tr>"
      }
      
      END {
         # wrap up html file
         print "</table>"
         print "</body>"
         print "</html>"
      }' $1\
   > ${1%txt}html # output to html file
}

The HTML file can then be read by LibreOffice, and you can start correcting all those ~’s.
I also change the font to italic and bold where indicated in the scans while in LibreOffice, but for other types of formatting I prefer to use %-type mnemonics.

Next: Extracting the text from this HTML-file, handling mnemonics and footnotes, and producing an XHTML-compliant file.

12-04-2011, 04:12 PM	#6
SBT Fanatic Posts: 580 Karma: 810184 Join Date: Sep 2010 Location: Norway Device: prs-t1, tablet, Nook Simple, assorted kindles, iPad	Page scans and OCR text side by side Thanks for feedback, Jellby and DiapDealer. I’ve updated the function accordingly. Good proofreading patterns are probably worthy a thread of their own. Anyhow, on to todays task: When proofreading an OCR text, it’s a necessity to have the scanned page images side by side with the text. Of course you can open the djvu/pdf file in a viewer and the text in a separate editor, but it is a trifle tiresome to hop back and forth between them to synchronize page viewing. In a previous post I presented a script to combine the images and text in an HTML table, which then could be imported into LibreOffice and edited there. A slightly revised version is shown below. First, a directory is filled with the page images extracted from the djvu-file. As this is a time-consuming operation, it is devolved to a separate function. This version assumes the book is less than 1000 pages. It also scales down the image, and clips it. This clipping is probably book-dependent, and the coordinates are probably possible to extract from the djvu-file, but finding out how is on the TODO list. Required tools: netpbm and cjpeg. Code: function extractpagescans { # Usage: extractpagescans <djvufile>. # Creates a jpeg-file of each page, and stores it in directory "pages" mkdir pages n=$(djvused $1 -e 'n') for x in $(seq $n) do ddjvu -format=ppm -page=$x -segment=1700x2850+200+200 $1 - \|pnmscale 0.5 \| cjpeg -quality 35 -smooth 50 -scale "1/2" -optimize>$(printf "pages/%3.3d.jpg" $x) echo $x done } Then the HTML file is constructed. The text on each page is enclosed between <pre></pre> tags to preserve line breaks and other formatting. Code: function makeproofreadhtml { #Usage: makeproofreadhtml <textfile> #creates a html file with a two-column table, page scans to the left, OCR text t o the right. imgdir=pages awk -v img="$(basename $imgdir)" ' BEGIN { # Set form feed (new page) control character as record separator RS="%P" charset="utf8" # html header print "<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.1//EN\" \"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd\">\n\ <html xmlns=\"http://www.w3.org/1999/xhtml\">\n\ <head>\n\ <meta http-equiv=\"content-type\" content=\"text/html; charset="charset"\" />\n\ </head>\n\ <body>\n\ <table>" } { # substitute <,>,& characters with html codes gsub("&","&") gsub("<","<") gsub(">",">") # add scan image in left column print "<tr><td>" printf("<img width=\"500\" src=\"%s/%3.3d.jpg\">", img, NR-1) print "</td>" # add text as preformatted text, preserving line breaks etc., in right column print "<td>" # add PDF file page number as cell header, embedded in HTML comment print "<pre>" printf("%%G ") print print "</pre>" print "</td></tr>" } END { # wrap up html file print "</table>" print "</body>" print "</html>" }' $1\ > ${1%txt}html # output to html file } The HTML file can then be read by LibreOffice, and you can start correcting all those ~’s. I also change the font to italic and bold where indicated in the scans while in LibreOffice, but for other types of formatting I prefer to use %-type mnemonics. Next: Extracting the text from this HTML-file, handling mnemonics and footnotes, and producing an XHTML-compliant file. Last edited by SBT; 12-04-2011 at 04:12 PM. Reason: typo