Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Sigil

Notices

Reply
 
Thread Tools Search this Thread
Old 06-23-2025, 12:11 AM   #1
jwes
Connoisseur
jwes began at the beginning.
 
Posts: 74
Karma: 10
Join Date: Jul 2023
Device: none
Problem with .* in regex search

In Sigil 2.5.2 on Windows, if I regex search for .* without the text option, I find the first line in each file. With the text option, the first time I click find, I find from the cursor location to the end of the line, but the next time I get search not found, though count always shows no matches.
jwes is offline   Reply With Quote
Old 06-23-2025, 01:54 AM   #2
DNSB
Bibliophagist
DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.
 
DNSB's Avatar
 
Posts: 46,181
Karma: 168983734
Join Date: Jul 2010
Location: Vancouver
Device: Kobo Sage, Libra Colour, Lenovo M8 FHD, Paperwhite 4, Tolino epos
Hmmm.... using '.*' as the search string would search the selected files one line at a time. From what you say, it would appear that this is correctly searching each file and returning the first match. As for using the text checkbox, the only time I would use that is to make sure I am not searching inside <...>.

. matches any character (except for line terminators)

* matches the previous token between zero and unlimited times, as many times as possible, giving back as needed (greedy)
DNSB is offline   Reply With Quote
Advert
Old 06-23-2025, 09:06 AM   #3
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 8,761
Karma: 5706256
Join Date: Nov 2009
Device: many
Are you searching in Regex mode?
Under regex options, what exact options are checked? Do you have Dot All set? Do you have Minimal matches set?

Your first behaviour is consistent with no Minimal match and no Dot All.

The second find test will always start where the first one ends.

And the text box will always search from the cursor slipping past any tags. That way it only detects text outside of tags. Its results depend on starting point and the tags that exist and where they are located, and of course the Regex Options that you set.


In my testing, it all works as expected given the regex options.

Please provide a sample html, and all your exact find settings indicating what you get versus what you expect to see, so we can recreate what you are seeing.
KevinH is offline   Reply With Quote
Old 06-23-2025, 04:43 PM   #4
jwes
Connoisseur
jwes began at the beginning.
 
Posts: 74
Karma: 10
Join Date: Jul 2023
Device: none
Quote:
Originally Posted by KevinH View Post
Are you searching in Regex mode?
Under regex options, what exact options are checked? Do you have Dot All set? Do you have Minimal matches set?

Your first behaviour is consistent with no Minimal match and no Dot All.

The second find test will always start where the first one ends.

And the text box will always search from the cursor slipping past any tags. That way it only detects text outside of tags. Its results depend on starting point and the tags that exist and where they are located, and of course the Regex Options that you set.


In my testing, it all works as expected given the regex options.

Please provide a sample html, and all your exact find settings indicating what you get versus what you expect to see, so we can recreate what you are seeing.
Here is what I get from saving and exporting the search
[search_entries]
1\Name=Unnamed Search
1\Find=.*
1\Replace=
1\Controls=RX DN AH
size=1

What I expect is to find the next line each time I click on find. What I get is I find the first line in each html file.
It happens on every epub I've tried. I will attach one.
Attached Files
File Type: epub hilaire-belloc_the-path-to-rome.epub (18.56 MB, 9 views)
jwes is offline   Reply With Quote
Old 06-23-2025, 07:19 PM   #5
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 8,761
Karma: 5706256
Join Date: Nov 2009
Device: many
Yes, this must be something the new pcre2 module changed.

I will look into it.

In the interim use:
.+

which will find all non-empty lines

Or Turn on the Minimal Match flag and the DotAll flag in Regex options then search for the following:

.*\n

it will return each line (empty or not) and its ending line feed.

Somehow pcre is not advancing its internal search position when the search string is ".*" because a zero length string is also a match for this case.

Very strange.

Last edited by KevinH; 06-23-2025 at 08:23 PM.
KevinH is offline   Reply With Quote
Advert
Old 06-23-2025, 09:35 PM   #6
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 8,761
Karma: 5706256
Join Date: Nov 2009
Device: many
I am not sure now that this is a bug in PCRE2 or Sigil.

Sigil's internal PCRE2 runs always in what PCRE2 calls multi-line mode and if you turn off DOTALL then you are forcing it into single line mode as it can not go past the end of line (newline) char, when DOTALL is off.

To do what you want in regex multiline mode (the default for Sigil search) and to make sure you get the full text of each line and even the full text of the last line even if does not end in a newline, the following regex works just fine in PCRE2 multi-line mode.

^.*\n?

But please be aware that new lines can be inside of tags themselves not just in the text in between. So searching line by line is possible but not a good idea in general when processing multi-line text like xhtml.

Typically when in multi-line mode you set DOTALL to be true so that newlines characters can be treated just like any other character when matching.

I am still not sure. The behaviour is strange but given Sigil's PCRE2 is hard coded to multi-line mode, it appears to be more a limitation than a bug.

I will keep digging.
KevinH is offline   Reply With Quote
Old 06-24-2025, 02:08 AM   #7
DNSB
Bibliophagist
DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.
 
DNSB's Avatar
 
Posts: 46,181
Karma: 168983734
Join Date: Jul 2010
Location: Vancouver
Device: Kobo Sage, Libra Colour, Lenovo M8 FHD, Paperwhite 4, Tolino epos
I think the behaviour of Sigil's regex in returning the first line of each file is correct if the global flag is not set. If I recall correctly—which may be iffy given the number of RegEx flavours I've played with—if global is not set, the search will return the first instance, if it is set, it will return all instances. So in this case, returning the first line of each file would be correct if global is not set.

Last edited by DNSB; 06-24-2025 at 02:10 AM.
DNSB is offline   Reply With Quote
Old 06-24-2025, 02:25 AM   #8
jwes
Connoisseur
jwes began at the beginning.
 
Posts: 74
Karma: 10
Join Date: Jul 2023
Device: none
(?<=\n).* and .*(?=\n) also don't work, though (?<=\n).* finds the second line of each file. (.*)\n works well enough for my purpose. I am editing OCRed text where tables were converted as plain text and wanted to put <tr> </tr> around each line.
jwes is offline   Reply With Quote
Old 06-24-2025, 04:08 PM   #9
DNSB
Bibliophagist
DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.
 
DNSB's Avatar
 
Posts: 46,181
Karma: 168983734
Join Date: Jul 2010
Location: Vancouver
Device: Kobo Sage, Libra Colour, Lenovo M8 FHD, Paperwhite 4, Tolino epos
I have some OCRed files to look at starting at some time in the next week. The author recovered her rights to the books but the publisher no longer has any of her submitted files or, perhaps, does not want to offer any help. Due to a computer crash years back, she no longer has the files either. What she has is scanned copies of the pages stored as a multi-page TIFF file for each section which I am going to clean up and convert into ePubs. The scans look clean so I'm hopeful that it's not going to be a total PITA.

Ah well, it'll pay for my ebook addiction for a few months.
DNSB is offline   Reply With Quote
Old 06-24-2025, 04:56 PM   #10
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 8,761
Karma: 5706256
Join Date: Nov 2009
Device: many
Okay I have spent a long time looking at this and it seems the problem is related to internal PCRE2 settings/optimizations that happen when DOTSTAR is is used to start a regular expression search pattern.

It appears you can turn off this behaviour by prefacing that expression as follows:

(*NOTEMPTY)

which tells PCRE2 not to return any empty matches.

So please try the following:

(*NOTEMPTY).*

as your search pattern. It should give you what you want.
KevinH is offline   Reply With Quote
Old 06-24-2025, 05:09 PM   #11
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 8,761
Karma: 5706256
Join Date: Nov 2009
Device: many
I found a way to add PCRE2_NOTEMPTY as an option flag for the pcre2 match routine and hard code it in the Sigil code.

In that way Sigil search would default to what you expected (and what most people would) from the beginning.

I just need to make sure that this change does not end up breaking anything else.

Last edited by KevinH; 06-24-2025 at 05:23 PM.
KevinH is offline   Reply With Quote
Old 06-24-2025, 05:32 PM   #12
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 8,761
Karma: 5706256
Join Date: Nov 2009
Device: many
Quote:
Originally Posted by DNSB View Post
I think the behaviour of Sigil's regex in returning the first line of each file is correct if the global flag is not set. If I recall correctly—which may be iffy given the number of RegEx flavours I've played with—if global is not set, the search will return the first instance, if it is set, it will return all instances. So in this case, returning the first line of each file would be correct if global is not set.
The global flags was needed for early regex use yo make it find and replace all occurrences. But for PCRE2 the engine is different and as long as you keep calling match with updated offsets, the search and replace will continue.
KevinH is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
regex newbie search end of string char problem michaelbr Sigil 6 10-15-2020 01:54 PM
Regex Search in Advanced Search Box franknight Library Management 2 07-08-2020 11:42 PM
Regex in search problems (NOT Search&Replace; the search bar) lairdb Calibre 3 03-15-2017 07:10 PM
Regex Search doesn't search all files in Edit Book GregTheGrate Editor 8 11-08-2016 12:47 AM
Search regex problem ColMac Editor 23 04-17-2015 03:58 PM


All times are GMT -4. The time now is 02:02 PM.


MobileRead.com is a privately owned, operated and funded community.