|
|
#1 |
|
Fanatic
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 516
Karma: 65460
Join Date: Jun 2011
Device: Kindle Voyage, Boox Go 7
|
Indefinite length lookbehind
I'm trying to find instances of the following string
Code:
said. “ I want to exclude matches where the string is preceded by a word, preceded by a closing curly quotation mark. e.g. Code:
” Jack said. “ Code:
(?<!”\s\w+?\s)said\. “ I was under the impression that 2.4.2's regex natively allows for indefinite length lookbehinds. What am I doing wrong? Is there some a different syntax that needs to be used for indefinite length lookbehinds? Last edited by ElMiko; 05-17-2025 at 06:15 AM. |
|
|
|
|
|
#2 |
|
A Hairy Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 3,413
Karma: 20212733
Join Date: Dec 2012
Location: Charleston, SC today
Device: iPhone 15/11/X/6/iPad 1,2,Air & Air Pro/Surface Pro/Kindle PW & Fire
|
Try giving it a batch of characters to choose from enclosed with square brackets?
Code:
(?<!”[\s\w\.]+)said\. “ Edit: You might even try tokenizing the \. in the negative look behind pattern to catch any punctuation \p{P} instead of just periods. Code:
(?<!”[\s\w\p{P}]+)said\. “
Last edited by Turtle91; 05-17-2025 at 08:17 AM. |
|
|
|
|
|
#3 |
|
Fanatic
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 516
Karma: 65460
Join Date: Jun 2011
Device: Kindle Voyage, Boox Go 7
|
@Turtle91 - I never cried as a child, and I started reciting sonnets in the natal ward.
Unfortunately, this syntax doesn't work either. As with my attempt, the element that breaks it is the quantifier "+"—basically, the bit of the search that is supposed to be making it indefinite in length! The problem I'm trying to solve is that the OCR misread many commas as periods, resulting in text like: Code:
He turned as Charles said. “Howdy!" Code:
“Let's go,” Charles said. “I think I'm done here.” |
|
|
|
|
|
#4 |
|
Fanatic
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 516
Karma: 65460
Join Date: Jun 2011
Device: Kindle Voyage, Boox Go 7
|
Hmmm, I found this...
But with all respect to the author, I can't make heads or tails of the explanation... much less how to apply it to anything other than matching the letter "X"... Just as importantly, I can't even get it to match the letter "X" in any given Sigil file... Last edited by ElMiko; 05-17-2025 at 09:07 AM. |
|
|
|
|
|
#5 |
|
Sigil Developer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 9,126
Karma: 6404930
Join Date: Nov 2009
Device: many
|
See the pcre2 maintain had to say when he implemented this in 2023 here:
https://github.com/PCRE2Project/pcre2/issues/269 It seems the PCRE2 approach requires a backwards max range and not a + |
|
|
|
|
|
#6 | |
|
Fanatic
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 516
Karma: 65460
Join Date: Jun 2011
Device: Kindle Voyage, Boox Go 7
|
Quote:
Code:
(?<!”\s\w{1,10}\s)said\. “
|
|
|
|
|
|
|
#7 |
|
Sigil Developer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 9,126
Karma: 6404930
Join Date: Nov 2009
Device: many
|
What does the exact error message say? Mouse over the find field or valid regex symbol?
Does it show the exact error message? Try a character range not a word range. Did that change the error? |
|
|
|
|
|
#8 | |
|
Sigil Developer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 9,126
Karma: 6404930
Join Date: Nov 2009
Device: many
|
I checked the pcre2 source for changes and saw this:
Quote:
Are you using an assertion properly? A more specific error message might help if you can get one. |
|
|
|
|
|
|
#9 |
|
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 5,763
Karma: 24088559
Join Date: Dec 2010
Device: Kindle PW2
|
@ElMiko A long time ago I created a throw-away Sigil regex tester validation plugin that should theoretically work for your regex.
After the installation you'll find the plugin under Plugins > Validation > RegexTester. (You'll need to select the "regex" engine.) In my test case: Code:
<p>Lorem “ipsum dolor” Jack said. “</p> <p>Lorem ipsum dolor said. “</p> <p>Dolor amet said. “</p> Code:
(?<!”\s\w+\s)said\. “ |
|
|
|
|
|
#10 | |
|
Fanatic
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 516
Karma: 65460
Join Date: Jun 2011
Device: Kindle Voyage, Boox Go 7
|
Quote:
I've tried several permutations of the regex: Code:
\w \S \u \l \D [a-z] . @Doitsu - Yeah, I don't know what's going on. |
|
|
|
|
|
|
#11 |
|
Well trained by Cats
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 31,342
Karma: 62025226
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
|
\S is not \s
\S is not a space char |
|
|
|
|
|
#12 |
|
Fanatic
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 516
Karma: 65460
Join Date: Jun 2011
Device: Kindle Voyage, Boox Go 7
|
I know that, theducks. That's why I used it. Non-space, followed by min/max range, followed by space. But even if it were a mistake, the point is that the STRUCTURE is being interpreted as invalid.
But also, even if it had been a mistake it wouldn't explain why the other variants aren't working either. Last edited by ElMiko; 05-17-2025 at 08:26 PM. |
|
|
|
|
|
#13 |
|
Sigil Developer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 9,126
Karma: 6404930
Join Date: Nov 2009
Device: many
|
It is our long holiday weekend in Canada, but I finally got some time to test things in Sigil on my only laptop up here at my cottage. It is a pre-release version of the forthcoming Sigil v2.50.
I decided to test the example cited by one of the issues posted at PCRE2 in that link I posted earlier. In my xhtml file I have: Code:
<p> 0xxxy </p> Code:
(?<=0x{1,6})y
Would you please try this test with your Sigil 2.4.2 and let me know if you get the same thing? Perhaps there was a bug in PCRE2 10.44 that got fixed in PCRE2 10.45 which is in the upcoming release of Sigil. I will try Doitsu's test next. Last edited by KevinH; 05-17-2025 at 09:29 PM. |
|
|
|
|
|
#14 |
|
Sigil Developer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 9,126
Karma: 6404930
Join Date: Nov 2009
Device: many
|
Okay I tested the following in Sigil 2.50 (pre-release):
The xhtml file was: Code:
<p>Lorem “ipsum dolor” Jack said. “</p> <p>Lorem ipsum dolor said. “</p> <p>Dolor amet said. “</p> <p> 0xxxy </p> Code:
(?<!”\s\w{1,6}\s)said\. “
So as far as I can tell with these examples, all is working. But again this version of Sigil has a newer version of PCRE2 (10.45) than the version that came in Sigil 2.4.2 (10.44), so since you are seeing something different I would guess that there was a PCRE2 bug in 10.44 that got fixed. If it is any help, we are hoping to do final updates of the translations this week and will try to make a full release by next weekend if both of us can work it into our schedules. If you desperately need something immediately, I can generate a CI build of current Sigil master (it will be missing translations in most languages) and make a link available to you. But please test your Sigil 2.4.2 build and let us know if it fails these very specific tests (ie. if there was a PCRE2 bug). Last edited by KevinH; 05-17-2025 at 09:37 PM. |
|
|
|
|
|
#15 |
|
Fanatic
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 516
Karma: 65460
Join Date: Jun 2011
Device: Kindle Voyage, Boox Go 7
|
No rush on my end; I can put a pin in this one until the next release. Thanks, as always, KevinH!
|
|
|
|
![]() |
|
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Battery length | ORLOV | General Discussions | 22 | 07-28-2011 05:14 PM |
| Which length of fiction? | crich70 | Writers' Corner | 12 | 06-03-2011 07:27 PM |
| File length in MB only | clockmaker | Calibre | 1 | 07-20-2010 11:35 AM |
| .7.5 - Zero Length Zips | edbro | Calibre | 2 | 06-27-2010 06:22 PM |
| length of ebooks? | poshm | Writers' Corner | 20 | 11-17-2009 11:30 AM |