MobileRead Forums - View Single Post

SBT · 12-03-2011, 02:52 PM

OCR does stumble here and there. So does human proofreading, it’s easy to miss the occasional mispnirt... However, with some careful thought, it is possible to construct search patterns which identify a decent proportion of them.
The first category of errors is the one that can be automatically corrected. Spaces before punctuation like ;:,?! can be safely removed, as can spaces after quotation marks at the start of a line or before them at the end of a line. Likewise ‘ tlie’ can confidently be replaced with ‘ the’, and ‘ m ‘ with ‘ in ‘.

Code:

function tlie_m_punctuationclean {
# Usage: tlie_m_punctuationclean <text file>.
# Autocorrects in-place some OCR errors.
sed -i -e s/"  *\([:?!;]\)"/"\1"/g \
-e s/"\(^ *\|   \)\" \+"/"\1\""/ \
-e s/" \+\" \+$"/"\""/ \
-e s/"\([       ][Tt]\|^[Tt]\)lie"/"\1he"/ \
-e s/" m "/" in "/ \
$*
sed -i -e s/"\"'"/"\"\ '"/g \
-e s/"'\""/"'\ \""/g \
$*
}

The second category consists of evident errors, but where the correct version is not self-evident. Capital letters immediately after lower case, numbers following letters, symbols embedded in letters, and q not followed by u are typical. The following function prepends words which contain any such combination with a ‘~’(tilde). This complements the hat, ‘^’, which is used by many OCR programs to indicate failure to interpret. So afterwards, you have to search for ~’s and ^’s.

Code:

function marksuspects {
# Usage: marksuspects <text file>. Prepends a ^ in front of words that need
#    correction. Edits in-place.
sed -i s/"\([^ ]*\)\([a-z][A-Z0-9]\|[A-Za-z][(){\[\]}.,;:?!][A-Za-z]\|q[^u]\)"/"~\1\2"/g $1
}

The search patterns in these functions can also be used in editors which support regular expressions; Sigil and LibreOffice do.

Next: Combining page scans with page text.

12-03-2011, 02:52 PM	#2
SBT Fanatic Posts: 580 Karma: 810184 Join Date: Sep 2010 Location: Norway Device: prs-t1, tablet, Nook Simple, assorted kindles, iPad	Finding and correcting OCR errors OCR does stumble here and there. So does human proofreading, it’s easy to miss the occasional mispnirt... However, with some careful thought, it is possible to construct search patterns which identify a decent proportion of them. The first category of errors is the one that can be automatically corrected. Spaces before punctuation like ;:,?! can be safely removed, as can spaces after quotation marks at the start of a line or before them at the end of a line. Likewise ‘ tlie’ can confidently be replaced with ‘ the’, and ‘ m ‘ with ‘ in ‘. Code: function tlie_m_punctuationclean { # Usage: tlie_m_punctuationclean <text file>. # Autocorrects in-place some OCR errors. sed -i -e s/" \([:?!;]\)"/"\1"/g \ -e s/"\(^ \\| \)\" \+"/"\1\""/ \ -e s/" \+\" \+$"/"\""/ \ -e s/"\([ ][Tt]\\|^[Tt]\)lie"/"\1he"/ \ -e s/" m "/" in "/ \ $* sed -i -e s/"\"'"/"\"\ '"/g \ -e s/"'\""/"'\ \""/g \ $* } The second category consists of evident errors, but where the correct version is not self-evident. Capital letters immediately after lower case, numbers following letters, symbols embedded in letters, and q not followed by u are typical. The following function prepends words which contain any such combination with a ‘~’(tilde). This complements the hat, ‘^’, which is used by many OCR programs to indicate failure to interpret. So afterwards, you have to search for ~’s and ^’s. Code: function marksuspects { # Usage: marksuspects <text file>. Prepends a ^ in front of words that need # correction. Edits in-place. sed -i s/"\([^ ]\)\([a-z][A-Z0-9]\\|[A-Za-z][(){\[\]}.,;:?!][A-Za-z]\\|q[^u]\)"/"~\1\2"/g $1 } The search patterns in these functions can also be used in editors which support regular expressions; Sigil and LibreOffice do. Next: Combining page scans with page text. Last edited by SBT; 12-04-2011 at 03:02 PM. Reason: Updated tlie_m_punctuationclean acc. to sug. frm. DiapDealer & Jellby*