View Single Post
Old 06-17-2013, 07:51 AM   #20
SBT
Fanatic
SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.
 
SBT's Avatar
 
Posts: 580
Karma: 810184
Join Date: Sep 2010
Location: Norway
Device: prs-t1, tablet, Nook Simple, assorted kindles, iPad
Proofreading regex

What it does:
Flags typical OCR errors (at least in tesseract)
  • Capital letter immediately after lower case letter
  • Digit after letter
  • Lower case after full stop
  • Space before colon/semicolon

Regexp:
Code:
([a-z][A-Z]|[a-zA-Z][0-9]|\. *[a-z]| [;:])
Faults:
False positives, especially abbrev. which are followed by lower-case letters.

Regex variant:
Perl & similar. Add \ before (,),| to get a "regular" regexp.
SBT is offline   Reply With Quote