Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Sigil

Notices

Reply
 
Thread Tools Search this Thread
Old 01-22-2025, 08:03 AM   #1
ElMiko
Addict
ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.
 
ElMiko's Avatar
 
Posts: 326
Karma: 56788
Join Date: Jun 2011
Device: Kindle
[RegEx] How to match a string occurring somewhere between quotation marks

I am trying to find a word, but only when it occurs between quotation marks. For example, I want to find the word "were", but only when it occurs in dialogue (the punctuation has been smartened, by the way). As in:

Code:
“That’s where were going!”
Normally I'd try to use a combination of lookaheads and lookbehinds to find it, as in:

Code:
(?<=“.*?)\bwere\b(?=.*?”)
But because lookaheads/lookbehinds need a defined character length (I think), the above won't actually work...

Is there a way to match *and* isolate the "were"? Or is matching the whole dialogue string the best one can hope for (eg: “.*?\bwere\b.*?”)?

Last edited by ElMiko; 01-22-2025 at 08:18 AM.
ElMiko is offline   Reply With Quote
Old 01-22-2025, 09:37 AM   #2
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 8,304
Karma: 5568878
Join Date: Nov 2009
Device: many
Why use look ahead and behinds given smart quotes make begin and end directional?

Have you tried something simpler like:

“[^”]*\s(were)[,;!?.\s][^“”]*”

So it looks for a starting quote, then any number of things that are not an ending quote followed by a space then the word in question followed by either a space or punctuation marks, then followed by any number of things that are not a beginning or ending quote, and then finally an ending quote.

Give a version of that a try.

We use this approach when using regex to find the next opening or closing tag by replacing the smart quotes with < and >

Last edited by KevinH; 01-22-2025 at 10:45 AM.
KevinH is offline   Reply With Quote
Old 01-22-2025, 09:46 AM   #3
DiapDealer
Grand Sorcerer
DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.
 
DiapDealer's Avatar
 
Posts: 28,173
Karma: 201721072
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
Quote:
Originally Posted by ElMiko View Post
Normally I'd try to use a combination of lookaheads and lookbehinds to find it, as in:

Code:
(?<=“.*?)\bwere\b(?=.*?”)
But because lookaheads/lookbehinds need a defined character length (I think), the above won't actually work...
Lookaheads don't require a defined width. Neither in the built-in re module, nor in the Barnett regex module included with Sigil's bundled python. The Barnett regex module also has the added advantage of allowing variable width lookbehinds as well. Just use "import regex as re" to work with existing code.

But you're right that the built-in re module does not allow variable-width lookbehinds.

Last edited by DiapDealer; 01-22-2025 at 09:49 AM.
DiapDealer is offline   Reply With Quote
Old 01-22-2025, 03:06 PM   #4
ElMiko
Addict
ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.
 
ElMiko's Avatar
 
Posts: 326
Karma: 56788
Join Date: Jun 2011
Device: Kindle
Quote:
Originally Posted by KevinH View Post
Have you tried something simpler like:

“[^”]*\s(were)[,;!?.\s][^“”]*”
So when I try this, I get the whole dialogue returned, as in:

“That’s where were going!”

rather than just the "were". This is a similar result to my clunkier

Code:
“.*?\bwere\b.*?”
which, as you note, also takes advantage of the directional nature of the quotation marks.

@DiapDealer — Ahhhh.... see, that's why it pays to hedge one's statements when one doesn't know what the heck he's talking about! That's really helpful.

Full disclosure, I'm still using an ancient version of Sigil (0.7.2). How can I—or can I even—use the Barnett regex module with it?

Otherwise, I've modified my search to:

Code:
\bwere\b(?=[^“]*?”)
Which seems to mostly work....
ElMiko is offline   Reply With Quote
Old 01-22-2025, 03:14 PM   #5
JSWolf
Resident Curmudgeon
JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.
 
JSWolf's Avatar
 
Posts: 77,845
Karma: 142032074
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
Why not upgrade your Sigil to the latest version?
JSWolf is online now   Reply With Quote
Old 01-22-2025, 03:23 PM   #6
DiapDealer
Grand Sorcerer
DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.
 
DiapDealer's Avatar
 
Posts: 28,173
Karma: 201721072
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
Oh, my bad. I was thinking the bundled python available for plugins with the newer Sigil. You are definitely limited to standard pcre regex when using Sigil's search and replace. So lookaheads can be variable width, but not lookbehinds. Sorry.

You still should be able to get Kevin's search working though. Why do you need to capture JUST the "were'? What's the end goal? Are you looking to replace "were" with something else, including nothing (deleting)?

If the lookbehind is what's holding you back, try refactoring it with \K instead. \K tells the engine to pretend the match starts immediately after.

Last edited by DiapDealer; 01-22-2025 at 03:42 PM.
DiapDealer is offline   Reply With Quote
Old 01-22-2025, 03:42 PM   #7
ElMiko
Addict
ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.
 
ElMiko's Avatar
 
Posts: 326
Karma: 56788
Join Date: Jun 2011
Device: Kindle
Quote:
Originally Posted by DiapDealer View Post
Oh, my bad. I was thinking the bundled python available for plugins with the newer Sigil. You are definitely limited to standard pcre regex when using Sigil's search and replace. So lookaheads can be variable width, but not lookbehinds. Sorry.

You still should be able to get Kevin's search working though. Why do you need to capture JUST the "were'? What's the end goal? Are you looking to replace "were" with something else, including nothing (deleting)?
Yeah, I knew that there were things the newer toy was going to do a heck of a lot better than the old, but I just couldn't get the WYSIWYG functionality working the way I wanted in the newer "plug-in" version, and—to respond to JSWolf—since I do A LOT of formatting in the WYSIWYG editor, I'm stuck with Old Faithful.

In response to your question, basically I'm trying to isolate instances of "were" that ought have been "we're". In other words, instances in which the apostrophe denoting a contraction has not been capture by the original OCR.

This mostly occurs in dialogue (rather than narrative), so I'm trying to quickly review the instances of the word and replace it if appropriate.

What I've got now (following your revelation about the lookahead) is:
Code:
(?<!\b[Tt]hey |\b[Ww]e |\b[Tt]he[rs]e |\b[Oo]thers |\b[Pp]eople |\b[Ss]ome |\b[Ss]he |\b[Yy]ou |\bit )\b([Ww])ere\b(?=[^“]*?”)
with the replace value as:
Code:
\1e’re
I suspect there will be some tail cases where this doesn't work, but it's already a darn sight better than it was: 23 matches now (clunky, but do-able) vs 187 matches then (pure nightmare).

Thanks, guys!

Last edited by ElMiko; 01-22-2025 at 03:47 PM.
ElMiko is offline   Reply With Quote
Old 01-22-2025, 03:53 PM   #8
DiapDealer
Grand Sorcerer
DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.
 
DiapDealer's Avatar
 
Posts: 28,173
Karma: 201721072
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
Try something like:

Code:
“[^”]*\b\Kwere\b(?=.*?”)
As mentioned above, \K restarts the match from that point (basically forgetting what came before). There's very few places it can't be used to get around the no variable-width lookbehind issue.
DiapDealer is offline   Reply With Quote
Old 01-22-2025, 03:56 PM   #9
ElMiko
Addict
ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.
 
ElMiko's Avatar
 
Posts: 326
Karma: 56788
Join Date: Jun 2011
Device: Kindle
Oh...
My...
Sexy...

Well that's a new tool for the toolbox. Amazing. Thank you!
ElMiko is offline   Reply With Quote
Old 01-22-2025, 03:58 PM   #10
DiapDealer
Grand Sorcerer
DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.
 
DiapDealer's Avatar
 
Posts: 28,173
Karma: 201721072
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
No problem. I almost forgot about it actually. It used to be one of my GOTOs.
DiapDealer is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Boxes instead of quotation marks testingfaze Conversion 2 12-03-2013 04:07 AM
Quotation marks overhanging? Cameronpaterson Kobo Reader 14 08-12-2011 06:16 AM
Quotation marks missing... lestatar Conversion 2 06-11-2011 07:39 AM
Funny looking quotation marks Novasea Workshop 9 12-09-2010 10:30 AM
Please help with quotation marks Vauh Calibre 5 04-28-2010 11:15 AM


All times are GMT -4. The time now is 07:57 AM.


MobileRead.com is a privately owned, operated and funded community.