MobileRead Forums - View Single Post

SBT · 09-19-2015, 05:21 PM

These are the things I normally look for in tesseract output:
'fi' and 'fl' often get mixed up. Still haven't found the best pattern here, but
/fi[oaie]/ definitely catches a few dodgy ones.
/[a-zA-Z][0-9]/ : a letter followed by a digit is pretty dodgy, though normally this only happens with 0 and 1.
/[a-z][A-Z]/ : a lower-case followed by an upper-case
/ [,.?!:]/ : Whitespace before punctuation.
/\b[bcdfhjklmnopqrstuvwxyz]\b/ : Single-letter words, excluding a, e.g., i.e.
/\b\(tl\|nr\|rr\)/ : Impossible beginnings to English words. Any linguists care to expand?

09-19-2015, 05:21 PM	#4
SBT Fanatic Posts: 580 Karma: 810184 Join Date: Sep 2010 Location: Norway Device: prs-t1, tablet, Nook Simple, assorted kindles, iPad	These are the things I normally look for in tesseract output: 'fi' and 'fl' often get mixed up. Still haven't found the best pattern here, but /fi[oaie]/ definitely catches a few dodgy ones. /[a-zA-Z][0-9]/ : a letter followed by a digit is pretty dodgy, though normally this only happens with 0 and 1. /[a-z][A-Z]/ : a lower-case followed by an upper-case / [,.?!:]/ : Whitespace before punctuation. /\b[bcdfhjklmnopqrstuvwxyz]\b/ : Single-letter words, excluding a, e.g., i.e. /\b\(tl\\|nr\\|rr\)/ : Impossible beginnings to English words. Any linguists care to expand? Last edited by SBT; 09-19-2015 at 05:23 PM.