What it does:
Flags typical OCR errors (at least in tesseract)
- Capital letter immediately after lower case letter
- Digit after letter
- Lower case after full stop
- Space before colon/semicolon
Regexp:
Code:
([a-z][A-Z]|[a-zA-Z][0-9]|\. *[a-z]| [;:])
Faults:
False positives, especially abbrev. which are followed by lower-case letters.
Regex variant:
Perl & similar. Add \ before (,),| to get a "regular" regexp.