Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Sigil

Notices

Reply
 
Thread Tools Search this Thread
Old 07-06-2012, 02:26 AM   #1
ElMiko
Addict
ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.
 
ElMiko's Avatar
 
Posts: 299
Karma: 56788
Join Date: Jun 2011
Device: Kindle
Matching words without using repetition operators

Often, I find that OCR software omits final puncutation marks between the last letter of a sentence and a closing end-quote:

Code:
eg. “My job is exhausting” Tom said laboriously.
What I basically do is a regex search for all instances of a letter followed immediately by a closing quote. Unfortunately, this matches instances where a single word is being isolated by quotation marks:

Code:
eg. Please define the words “trustworthy” and “gullible”.
I'm hoping I can slightly reduce the number of false positives by excluding instances in which the closing quote is preceded by a single word, which is itself immediately preceded by a single open-quote. My idea was:

Code:
(?<!“[\p{L}]+)(?<=\p{L})”
However, it looks like character repetition is not allowed within lookahead & lookbehind expressions. Does anyone have any ideas?
ElMiko is offline   Reply With Quote
Old 07-06-2012, 04:31 AM   #2
Jellby
frumious Bandersnatch
Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.
 
Jellby's Avatar
 
Posts: 6,197
Karma: 4800739
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
You could try a two-step (or three-step) process:

1. Replace all single word quotations cases with something that prevents a match in the next case. Something like (“[^ ]+)” and replace with \1¬”.
2. Do your normal search for unpunctuated quotes.
3. Remove all ¬

Anyway, you shouldn't do a global search and replace, there may be cases of multiple quoted words without punctuation, or single word speeches:

'What do you mean with "I don't know"?' he said. 'Weren't you listening?'
'No.'
Jellby is offline   Reply With Quote
Old 07-06-2012, 04:38 AM   #3
Tex2002ans
Fanatic
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 509
Karma: 392101
Join Date: Jul 2012
Device: Nook
I believe this regex might also help in catching some of these:

Code:
(“[^ ”]+ [^”,]+)(”)
then you can Replace with:

Code:
\1,\2
Where the comma is the punctuation you want to insert before the right double quotation.
Tex2002ans is offline   Reply With Quote
Old 07-06-2012, 04:40 AM   #4
ElMiko
Addict
ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.
 
ElMiko's Avatar
 
Posts: 299
Karma: 56788
Join Date: Jun 2011
Device: Kindle
ahhhhh, interesting. Never even occurred to me to break it up like that. Obviously, I'm still holding out hope for something that can be done in a single search, but failing that, your solution will work nicely. Thanks, Jellby!

@Tex2002ans - Thanks for the input! I'm not quite following all the pieces of your search, though... particularly the highlighted part below.

Code:
(“[^ ”]+\s[^”,]+)(”)

Last edited by ElMiko; 07-06-2012 at 04:54 AM.
ElMiko is offline   Reply With Quote
Old 07-06-2012, 06:56 AM   #5
Tex2002ans
Fanatic
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 509
Karma: 392101
Join Date: Jul 2012
Device: Nook
Quote:
Originally Posted by ElMiko View Post
ahhhhh, interesting. Never even occurred to me to break it up like that. Obviously, I'm still holding out hope for something that can be done in a single search, but failing that, your solution will work nicely. Thanks, Jellby!

@Tex2002ans - Thanks for the input! I'm not quite following all the pieces of your search, though... particularly the highlighted part below.

Code:
(“[^ ”]+\s[^”,]+)(”)
A single word in between quotes will have ZERO spaces, while a quotation with multiple words will have AT LEAST ONE space.

Actually I made a slight mistake in that regex. Here is a better version:

Code:
(“[^ ”]+\s[^”]+[^,!/?/.])(”)
The red section will grab the left quotation + the first word + space. Since a single word in quotations does not include a space, the red section will prepare the regex to only match TWO or more words between quotes.

The right quotation is NEEDED in the green section. This means that after the first word, it will continue to grab everything UP TO the right double quotation.

The characters in blue are OPTIONAL, and are there to say "if the quotation ends with this character, it is valid, so skip over this."

In this case, it says if the blue character is a ',', '!', '?', or '.', the quote is valid.

The Orange section just grabs the right quotation and makes it easy to do a Search and Replace.

Code:
“My job is exhausting. My job is very exhausting! Did I mention that my job is extremely exhausting Tom said laboriously.
If I wanted to say every quotation which ends with a 'g' is valid, I can use the regex:

Code:
(“[^ ”]+\s[^”]+[^g])(”)

Last edited by Tex2002ans; 07-06-2012 at 07:13 AM.
Tex2002ans is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
FBI charges Megaupload operators with piracy crimes xg4bx News 291 05-10-2012 05:56 AM
Better matching/scanning lbutlr Calibre 3 08-04-2010 03:44 PM
Matching Light for Kobo dixieknits Kobo Reader 2 07-19-2010 02:50 AM
(Development) What are these apparently-undefined python operators? offby1 Calibre 5 06-26-2010 11:57 AM
Literary Pattern Matching kennyc News 5 12-16-2009 03:12 PM


All times are GMT -4. The time now is 12:39 AM.


MobileRead.com is a privately owned, operated and funded community.