These are the things I normally look for in tesseract output:
'fi' and 'fl' often get mixed up. Still haven't found the best pattern here, but
/fi[oaie]/ definitely catches a few dodgy ones.
/[a-zA-Z][0-9]/ : a letter followed by a digit is pretty dodgy, though normally this only happens with 0 and 1.
/[a-z][A-Z]/ : a lower-case followed by an upper-case
/ [,.?!:]/ : Whitespace before punctuation.
/\b[bcdfhjklmnopqrstuvwxyz]\b/ : Single-letter words, excluding a, e.g., i.e.
/\b\(tl\|nr\|rr\)/ : Impossible beginnings to English words. Any linguists care to expand?
Last edited by SBT; 09-19-2015 at 05:23 PM.
|