MobileRead Forums - View Single Post

DiapDealer · 06-14-2012, 12:50 PM

Quote:

Originally Posted by ElMiko

The scenario I'm trying to catch is instances in which OCR software interpreted a ” as a ’ . my guess is that the appropriate regex would be something like ’\b(?!\p{Ll}). I could also probably add a negative lookbehind to exclude common instances of the ’ functioning as a (plural) possessive or to denote an omitted character (maybe something like (?<!s|in)). Mostly my question was academic: just a a way for me to get a better understanding of how and why reg-ex behaves the way it does.

Could be tough to differentiate possessive apostrophes or contractions from a closing single-quotes with any accuracy. But you might be able to narrow it down enough to feasibly inspect each occurrence.

A lot of times (but certainly not always) in a closing quote situation, the previous character is going to be punctuation of some kind. Quotes within quotes will probably foul things up, though.