07-06-2012, 02:26 AM | #1 |
Addict
Posts: 320
Karma: 56788
Join Date: Jun 2011
Device: Kindle
|
Matching words without using repetition operators
Often, I find that OCR software omits final puncutation marks between the last letter of a sentence and a closing end-quote:
Code:
eg. “My job is exhausting” Tom said laboriously. Code:
eg. Please define the words “trustworthy” and “gullible”. Code:
(?<!“[\p{L}]+)(?<=\p{L})” |
07-06-2012, 04:31 AM | #2 |
frumious Bandersnatch
Posts: 7,516
Karma: 18512745
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
|
You could try a two-step (or three-step) process:
1. Replace all single word quotations cases with something that prevents a match in the next case. Something like (“[^ ]+)” and replace with \1¬”. 2. Do your normal search for unpunctuated quotes. 3. Remove all ¬ Anyway, you shouldn't do a global search and replace, there may be cases of multiple quoted words without punctuation, or single word speeches: 'What do you mean with "I don't know"?' he said. 'Weren't you listening?' 'No.' |
Advert | |
|
07-06-2012, 04:38 AM | #3 |
Wizard
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
I believe this regex might also help in catching some of these:
Code:
(“[^ ”]+ [^”,]+)(”) Code:
\1,\2 |
07-06-2012, 04:40 AM | #4 |
Addict
Posts: 320
Karma: 56788
Join Date: Jun 2011
Device: Kindle
|
ahhhhh, interesting. Never even occurred to me to break it up like that. Obviously, I'm still holding out hope for something that can be done in a single search, but failing that, your solution will work nicely. Thanks, Jellby!
@Tex2002ans - Thanks for the input! I'm not quite following all the pieces of your search, though... particularly the highlighted part below. Code:
(“[^ ”]+\s[^”,]+)(”)
Last edited by ElMiko; 07-06-2012 at 04:54 AM. |
07-06-2012, 06:56 AM | #5 | |
Wizard
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
Actually I made a slight mistake in that regex. Here is a better version: Code:
(“[^ ”]+\s[^”]+[^,!/?/.])(”) The right quotation is NEEDED in the green section. This means that after the first word, it will continue to grab everything UP TO the right double quotation. The characters in blue are OPTIONAL, and are there to say "if the quotation ends with this character, it is valid, so skip over this." In this case, it says if the blue character is a ',', '!', '?', or '.', the quote is valid. The Orange section just grabs the right quotation and makes it easy to do a Search and Replace. Code:
“My job is exhausting. My job is very exhausting! Did I mention that my job is extremely exhausting” Tom said laboriously. Code:
(“[^ ”]+\s[^”]+[^g])(”) Last edited by Tex2002ans; 07-06-2012 at 07:13 AM. |
|
Advert | |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
FBI charges Megaupload operators with piracy crimes | xg4bx | News | 291 | 05-10-2012 05:56 AM |
Better matching/scanning | lbutlr | Calibre | 3 | 08-04-2010 03:44 PM |
Matching Light for Kobo | dixieknits | Kobo Reader | 2 | 07-19-2010 02:50 AM |
(Development) What are these apparently-undefined python operators? | offby1 | Calibre | 5 | 06-26-2010 11:57 AM |
Literary Pattern Matching | kennyc | News | 5 | 12-16-2009 03:12 PM |