View Single Post
Old 12-03-2011, 02:52 PM   #2
SBT
Fanatic
SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.
 
SBT's Avatar
 
Posts: 580
Karma: 810184
Join Date: Sep 2010
Location: Norway
Device: prs-t1, tablet, Nook Simple, assorted kindles, iPad
Finding and correcting OCR errors

OCR does stumble here and there. So does human proofreading, it’s easy to miss the occasional mispnirt... However, with some careful thought, it is possible to construct search patterns which identify a decent proportion of them.
The first category of errors is the one that can be automatically corrected. Spaces before punctuation like ;:,?! can be safely removed, as can spaces after quotation marks at the start of a line or before them at the end of a line. Likewise ‘ tlie’ can confidently be replaced with ‘ the’, and ‘ m ‘ with ‘ in ‘.
Code:
function tlie_m_punctuationclean {
# Usage: tlie_m_punctuationclean <text file>.
# Autocorrects in-place some OCR errors.
sed -i -e s/"  *\([:?!;]\)"/"\1"/g \
-e s/"\(^ *\|   \)\" \+"/"\1\""/ \
-e s/" \+\" \+$"/"\""/ \
-e s/"\([       ][Tt]\|^[Tt]\)lie"/"\1he"/ \
-e s/" m "/" in "/ \
$*
sed -i -e s/"\"'"/"\"\ '"/g \
-e s/"'\""/"'\ \""/g \
$*
}
The second category consists of evident errors, but where the correct version is not self-evident. Capital letters immediately after lower case, numbers following letters, symbols embedded in letters, and q not followed by u are typical. The following function prepends words which contain any such combination with a ‘~’(tilde). This complements the hat, ‘^’, which is used by many OCR programs to indicate failure to interpret. So afterwards, you have to search for ~’s and ^’s.
Code:
function marksuspects {
# Usage: marksuspects <text file>. Prepends a ^ in front of words that need
#    correction. Edits in-place.
sed -i s/"\([^ ]*\)\([a-z][A-Z0-9]\|[A-Za-z][(){\[\]}.,;:?!][A-Za-z]\|q[^u]\)"/"~\1\2"/g $1
}
The search patterns in these functions can also be used in editors which support regular expressions; Sigil and LibreOffice do.

Next: Combining page scans with page text.

Last edited by SBT; 12-04-2011 at 03:02 PM. Reason: Updated tlie_m_punctuationclean acc. to sug. frm. DiapDealer & Jellby
SBT is offline   Reply With Quote