View Single Post
Old 07-06-2012, 02:26 AM   #1
ElMiko
Evangelist
ElMiko has read every ebook posted at MobileReadElMiko has read every ebook posted at MobileReadElMiko has read every ebook posted at MobileReadElMiko has read every ebook posted at MobileReadElMiko has read every ebook posted at MobileReadElMiko has read every ebook posted at MobileReadElMiko has read every ebook posted at MobileReadElMiko has read every ebook posted at MobileReadElMiko has read every ebook posted at MobileReadElMiko has read every ebook posted at MobileReadElMiko has read every ebook posted at MobileRead
 
ElMiko's Avatar
 
Posts: 473
Karma: 65460
Join Date: Jun 2011
Device: Kindle
Matching words without using repetition operators

Often, I find that OCR software omits final puncutation marks between the last letter of a sentence and a closing end-quote:

Code:
eg. “My job is exhausting” Tom said laboriously.
What I basically do is a regex search for all instances of a letter followed immediately by a closing quote. Unfortunately, this matches instances where a single word is being isolated by quotation marks:

Code:
eg. Please define the words “trustworthy” and “gullible”.
I'm hoping I can slightly reduce the number of false positives by excluding instances in which the closing quote is preceded by a single word, which is itself immediately preceded by a single open-quote. My idea was:

Code:
(?<!“[\p{L}]+)(?<=\p{L})”
However, it looks like character repetition is not allowed within lookahead & lookbehind expressions. Does anyone have any ideas?
ElMiko is offline   Reply With Quote