Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Sigil

Notices

Reply
 
Thread Tools Search this Thread
Old Today, 04:21 AM   #1
ElMiko
Addict
ElMiko has read every ebook posted at MobileReadElMiko has read every ebook posted at MobileReadElMiko has read every ebook posted at MobileReadElMiko has read every ebook posted at MobileReadElMiko has read every ebook posted at MobileReadElMiko has read every ebook posted at MobileReadElMiko has read every ebook posted at MobileReadElMiko has read every ebook posted at MobileReadElMiko has read every ebook posted at MobileReadElMiko has read every ebook posted at MobileReadElMiko has read every ebook posted at MobileRead
 
ElMiko's Avatar
 
Posts: 391
Karma: 65460
Join Date: Jun 2011
Device: Kindle
Indefinite length lookbehind

I'm trying to find instances of the following string

Code:
said. “
(where the quotation mark is a curly opening quotation mark)

I want to exclude matches where the string is preceded by a word, preceded by a closing curly quotation mark. e.g.

Code:
” Jack said. “
My solution for this would be a regex search like:

Code:
(?<!”\s\w+?\s)said\. “
However, as soon as I add the quatifier "+" I get the error message for invalid Regex.

I was under the impression that 2.4.2's regex natively allows for indefinite length lookbehinds. What am I doing wrong? Is there some a different syntax that needs to be used for indefinite length lookbehinds?

Last edited by ElMiko; Today at 05:15 AM.
ElMiko is offline   Reply With Quote
Old Today, 06:58 AM   #2
Turtle91
A Hairy Wizard
Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.
 
Turtle91's Avatar
 
Posts: 3,313
Karma: 20171571
Join Date: Dec 2012
Location: Charleston, SC today
Device: iPhone 15/11/X/6/iPad 1,2,Air & Air Pro/Surface Pro/Kindle PW & Fire
Try giving it a batch of characters to choose from enclosed with square brackets?
Code:
(?<!”[\s\w\.]+)said\. “
Although I have a feeling that there is a much simpler way to accomplish what you are trying to do instead of this very narrow regex. What is your end goal??

Edit:
You might even try tokenizing the \. in the negative look behind pattern to catch any punctuation \p{P} instead of just periods.
Code:
(?<!”[\s\w\p{P}]+)said\. “
Please note: I’m replying on my phone after spending a very sleepless night with a grandson who is sprouting some molars … please test it and forgive any errors…

Last edited by Turtle91; Today at 07:17 AM.
Turtle91 is offline   Reply With Quote
Advert
Old Today, 07:30 AM   #3
ElMiko
Addict
ElMiko has read every ebook posted at MobileReadElMiko has read every ebook posted at MobileReadElMiko has read every ebook posted at MobileReadElMiko has read every ebook posted at MobileReadElMiko has read every ebook posted at MobileReadElMiko has read every ebook posted at MobileReadElMiko has read every ebook posted at MobileReadElMiko has read every ebook posted at MobileReadElMiko has read every ebook posted at MobileReadElMiko has read every ebook posted at MobileReadElMiko has read every ebook posted at MobileRead
 
ElMiko's Avatar
 
Posts: 391
Karma: 65460
Join Date: Jun 2011
Device: Kindle
@Turtle91 - I never cried as a child, and I started reciting sonnets in the natal ward.

Unfortunately, this syntax doesn't work either. As with my attempt, the element that breaks it is the quantifier "+"—basically, the bit of the search that is supposed to be making it indefinite in length!

The problem I'm trying to solve is that the OCR misread many commas as periods, resulting in text like:

Code:
He turned as Charles said. “Howdy!"
However, if I just search for "said\. “", I'll get false positives such as

Code:
“Let's go,”  Charles said. “I think I'm done here.”
Hence the structure of my search and the negative lookbehind.
ElMiko is offline   Reply With Quote
Old Today, 08:04 AM   #4
ElMiko
Addict
ElMiko has read every ebook posted at MobileReadElMiko has read every ebook posted at MobileReadElMiko has read every ebook posted at MobileReadElMiko has read every ebook posted at MobileReadElMiko has read every ebook posted at MobileReadElMiko has read every ebook posted at MobileReadElMiko has read every ebook posted at MobileReadElMiko has read every ebook posted at MobileReadElMiko has read every ebook posted at MobileReadElMiko has read every ebook posted at MobileReadElMiko has read every ebook posted at MobileRead
 
ElMiko's Avatar
 
Posts: 391
Karma: 65460
Join Date: Jun 2011
Device: Kindle
Hmmm, I found this...

But with all respect to the author, I can't make heads or tails of the explanation... much less how to apply it to anything other than matching the letter "X"...

Just as importantly, I can't even get it to match the letter "X" in any given Sigil file...

Last edited by ElMiko; Today at 08:07 AM.
ElMiko is offline   Reply With Quote
Old Today, 09:32 AM   #5
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 8,485
Karma: 5703586
Join Date: Nov 2009
Device: many
See the pcre2 maintain had to say when he implemented this in 2023 here:

https://github.com/PCRE2Project/pcre2/issues/269

It seems the PCRE2 approach requires a backwards max range and not a +
KevinH is offline   Reply With Quote
Advert
Old Today, 10:29 AM   #6
ElMiko
Addict
ElMiko has read every ebook posted at MobileReadElMiko has read every ebook posted at MobileReadElMiko has read every ebook posted at MobileReadElMiko has read every ebook posted at MobileReadElMiko has read every ebook posted at MobileReadElMiko has read every ebook posted at MobileReadElMiko has read every ebook posted at MobileReadElMiko has read every ebook posted at MobileReadElMiko has read every ebook posted at MobileReadElMiko has read every ebook posted at MobileReadElMiko has read every ebook posted at MobileRead
 
ElMiko's Avatar
 
Posts: 391
Karma: 65460
Join Date: Jun 2011
Device: Kindle
Quote:
Originally Posted by KevinH View Post
See the pcre2 maintain had to say when he implemented this in 2023 here:

https://github.com/PCRE2Project/pcre2/issues/269

It seems the PCRE2 approach requires a backwards max range and not a +
Afraid this doesn't seem to be working either. I tried:

Code:
(?<!”\s\w{1,10}\s)said\. “
and still get the "Regex Valid?" error message.
ElMiko is offline   Reply With Quote
Old Today, 10:43 AM   #7
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 8,485
Karma: 5703586
Join Date: Nov 2009
Device: many
What does the exact error message say? Mouse over the find field or valid regex symbol?
Does it show the exact error message? Try a character range not a word range. Did that change the error?
KevinH is offline   Reply With Quote
Old Today, 11:14 AM   #8
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 8,485
Karma: 5703586
Join Date: Nov 2009
Device: many
I checked the pcre2 source for changes and saw this:

Quote:
Version 10.43 16-February-2024
------------------------------

There are quite a lot of changes in this release (see ChangeLog and Git log for
a list). Those that are not bugfixes or code tidies
...

* Added support for limited-length variable-length lookbehind assertions, with
a default maximum length of 255 characters (same as Perl) but with a function
to adjust the limit.
And the version of PCRE2 in Sigil 2.4.2 is version 10.44 so support for limited length lookbehind assertions should be in Sigil 2.4.2.

Are you using an assertion properly? A more specific error message might help if you can get one.
KevinH is offline   Reply With Quote
Old Today, 11:58 AM   #9
Doitsu
Grand Sorcerer
Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.
 
Doitsu's Avatar
 
Posts: 5,683
Karma: 24031401
Join Date: Dec 2010
Device: Kindle PW2
@ElMiko A long time ago I created a throw-away Sigil regex tester validation plugin that should theoretically work for your regex.
After the installation you'll find the plugin under Plugins > Validation > RegexTester. (You'll need to select the "regex" engine.)

In my test case:

Code:
<p>Lorem “ipsum dolor” Jack said. “</p>
<p>Lorem ipsum dolor said. “</p>
<p>Dolor amet said. “</p>
it only selected the second and third paragraphs when I searched for:
Code:
(?<!”\s\w+\s)said\. “
For more information on the alternative regex Python module see the official documentation.
Doitsu is offline   Reply With Quote
Old Today, 04:44 PM   #10
ElMiko
Addict
ElMiko has read every ebook posted at MobileReadElMiko has read every ebook posted at MobileReadElMiko has read every ebook posted at MobileReadElMiko has read every ebook posted at MobileReadElMiko has read every ebook posted at MobileReadElMiko has read every ebook posted at MobileReadElMiko has read every ebook posted at MobileReadElMiko has read every ebook posted at MobileReadElMiko has read every ebook posted at MobileReadElMiko has read every ebook posted at MobileReadElMiko has read every ebook posted at MobileRead
 
ElMiko's Avatar
 
Posts: 391
Karma: 65460
Join Date: Jun 2011
Device: Kindle
Quote:
Originally Posted by KevinH View Post
What does the exact error message say? Mouse over the find field or valid regex symbol?
Does it show the exact error message? Try a character range not a word range. Did that change the error?
Please see the attached screengrab

I've tried several permutations of the regex:

Code:
\w
\S
\u
\l
\D
[a-z]
.
All show the same error, and return no matches.

@Doitsu - Yeah, I don't know what's going on.
Attached Thumbnails
Click image for larger version

Name:	SigilRegexError.jpg
Views:	9
Size:	56.0 KB
ID:	215722  
ElMiko is offline   Reply With Quote
Old Today, 05:51 PM   #11
theducks
Well trained by Cats
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 30,911
Karma: 60358908
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
\S is not \s
\S is not a space char
theducks is online now   Reply With Quote
Old Today, 07:21 PM   #12
ElMiko
Addict
ElMiko has read every ebook posted at MobileReadElMiko has read every ebook posted at MobileReadElMiko has read every ebook posted at MobileReadElMiko has read every ebook posted at MobileReadElMiko has read every ebook posted at MobileReadElMiko has read every ebook posted at MobileReadElMiko has read every ebook posted at MobileReadElMiko has read every ebook posted at MobileReadElMiko has read every ebook posted at MobileReadElMiko has read every ebook posted at MobileReadElMiko has read every ebook posted at MobileRead
 
ElMiko's Avatar
 
Posts: 391
Karma: 65460
Join Date: Jun 2011
Device: Kindle
I know that, theducks. That's why I used it. Non-space, followed by min/max range, followed by space. But even if it were a mistake, the point is that the STRUCTURE is being interpreted as invalid.

But also, even if it had been a mistake it wouldn't explain why the other variants aren't working either.

Last edited by ElMiko; Today at 07:26 PM.
ElMiko is offline   Reply With Quote
Old Today, 08:16 PM   #13
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 8,485
Karma: 5703586
Join Date: Nov 2009
Device: many
It is our long holiday weekend in Canada, but I finally got some time to test things in Sigil on my only laptop up here at my cottage. It is a pre-release version of the forthcoming Sigil v2.50.

I decided to test the example cited by one of the issues posted at PCRE2 in that link I posted earlier.

In my xhtml file I have:

Code:
<p> 0xxxy </p>
And as my find I have:

Code:
(?<=0x{1,6})y
And I get no error and it works to find the y in the text.

Would you please try this test with your Sigil 2.4.2 and let me know if you get the same thing? Perhaps there was a bug in PCRE2 10.44 that got fixed in PCRE2 10.45 which is in the upcoming release of Sigil.

I will try Doitsu's test next.

Last edited by KevinH; Today at 08:29 PM.
KevinH is offline   Reply With Quote
Old Today, 08:28 PM   #14
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 8,485
Karma: 5703586
Join Date: Nov 2009
Device: many
Okay I tested the following in Sigil 2.50 (pre-release):

The xhtml file was:

Code:
  <p>Lorem “ipsum dolor” Jack said. “</p>

  <p>Lorem ipsum dolor said. “</p>

  <p>Dolor amet said. “</p>

  <p> 0xxxy </p>
And the search string I had to modify to make it a finite quantity:

Code:
(?<!”\s\w{1,6}\s)said\. “
And when I run it the find only goes to the second and third lines.

So as far as I can tell with these examples, all is working.

But again this version of Sigil has a newer version of PCRE2 (10.45) than the version that came in Sigil 2.4.2 (10.44), so since you are seeing something different I would guess that there was a PCRE2 bug in 10.44 that got fixed.

If it is any help, we are hoping to do final updates of the translations this week and will try to make a full release by next weekend if both of us can work it into our schedules.

If you desperately need something immediately, I can generate a CI build of current Sigil master (it will be missing translations in most languages) and make a link available to you.

But please test your Sigil 2.4.2 build and let us know if it fails these very specific tests (ie. if there was a PCRE2 bug).

Last edited by KevinH; Today at 08:37 PM.
KevinH is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Battery length ORLOV General Discussions 22 07-28-2011 04:14 PM
Which length of fiction? crich70 Writers' Corner 12 06-03-2011 06:27 PM
File length in MB only clockmaker Calibre 1 07-20-2010 10:35 AM
.7.5 - Zero Length Zips edbro Calibre 2 06-27-2010 05:22 PM
length of ebooks? poshm Writers' Corner 20 11-17-2009 10:30 AM


All times are GMT -4. The time now is 09:36 PM.


MobileRead.com is a privately owned, operated and funded community.