![]() |
#1 |
Connoisseur
![]() Posts: 74
Karma: 10
Join Date: Jul 2023
Device: none
|
Problem with .* in regex search
In Sigil 2.5.2 on Windows, if I regex search for .* without the text option, I find the first line in each file. With the text option, the first time I click find, I find from the cursor location to the end of the line, but the next time I get search not found, though count always shows no matches.
|
![]() |
![]() |
![]() |
#2 |
Bibliophagist
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 46,181
Karma: 168983734
Join Date: Jul 2010
Location: Vancouver
Device: Kobo Sage, Libra Colour, Lenovo M8 FHD, Paperwhite 4, Tolino epos
|
Hmmm.... using '.*' as the search string would search the selected files one line at a time. From what you say, it would appear that this is correctly searching each file and returning the first match. As for using the text checkbox, the only time I would use that is to make sure I am not searching inside <...>.
. matches any character (except for line terminators) * matches the previous token between zero and unlimited times, as many times as possible, giving back as needed (greedy) |
![]() |
![]() |
Advert | |
|
![]() |
#3 |
Sigil Developer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 8,761
Karma: 5706256
Join Date: Nov 2009
Device: many
|
Are you searching in Regex mode?
Under regex options, what exact options are checked? Do you have Dot All set? Do you have Minimal matches set? Your first behaviour is consistent with no Minimal match and no Dot All. The second find test will always start where the first one ends. And the text box will always search from the cursor slipping past any tags. That way it only detects text outside of tags. Its results depend on starting point and the tags that exist and where they are located, and of course the Regex Options that you set. In my testing, it all works as expected given the regex options. Please provide a sample html, and all your exact find settings indicating what you get versus what you expect to see, so we can recreate what you are seeing. |
![]() |
![]() |
![]() |
#4 | |
Connoisseur
![]() Posts: 74
Karma: 10
Join Date: Jul 2023
Device: none
|
Quote:
[search_entries] 1\Name=Unnamed Search 1\Find=.* 1\Replace= 1\Controls=RX DN AH size=1 What I expect is to find the next line each time I click on find. What I get is I find the first line in each html file. It happens on every epub I've tried. I will attach one. |
|
![]() |
![]() |
![]() |
#5 |
Sigil Developer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 8,761
Karma: 5706256
Join Date: Nov 2009
Device: many
|
Yes, this must be something the new pcre2 module changed.
I will look into it. In the interim use: .+ which will find all non-empty lines Or Turn on the Minimal Match flag and the DotAll flag in Regex options then search for the following: .*\n it will return each line (empty or not) and its ending line feed. Somehow pcre is not advancing its internal search position when the search string is ".*" because a zero length string is also a match for this case. Very strange. Last edited by KevinH; 06-23-2025 at 08:23 PM. |
![]() |
![]() |
Advert | |
|
![]() |
#6 |
Sigil Developer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 8,761
Karma: 5706256
Join Date: Nov 2009
Device: many
|
I am not sure now that this is a bug in PCRE2 or Sigil.
Sigil's internal PCRE2 runs always in what PCRE2 calls multi-line mode and if you turn off DOTALL then you are forcing it into single line mode as it can not go past the end of line (newline) char, when DOTALL is off. To do what you want in regex multiline mode (the default for Sigil search) and to make sure you get the full text of each line and even the full text of the last line even if does not end in a newline, the following regex works just fine in PCRE2 multi-line mode. ^.*\n? But please be aware that new lines can be inside of tags themselves not just in the text in between. So searching line by line is possible but not a good idea in general when processing multi-line text like xhtml. Typically when in multi-line mode you set DOTALL to be true so that newlines characters can be treated just like any other character when matching. I am still not sure. The behaviour is strange but given Sigil's PCRE2 is hard coded to multi-line mode, it appears to be more a limitation than a bug. I will keep digging. |
![]() |
![]() |
![]() |
#7 |
Bibliophagist
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 46,181
Karma: 168983734
Join Date: Jul 2010
Location: Vancouver
Device: Kobo Sage, Libra Colour, Lenovo M8 FHD, Paperwhite 4, Tolino epos
|
I think the behaviour of Sigil's regex in returning the first line of each file is correct if the global flag is not set. If I recall correctly—which may be iffy given the number of RegEx flavours I've played with—if global is not set, the search will return the first instance, if it is set, it will return all instances. So in this case, returning the first line of each file would be correct if global is not set.
Last edited by DNSB; 06-24-2025 at 02:10 AM. |
![]() |
![]() |
![]() |
#8 |
Connoisseur
![]() Posts: 74
Karma: 10
Join Date: Jul 2023
Device: none
|
(?<=\n).* and .*(?=\n) also don't work, though (?<=\n).* finds the second line of each file. (.*)\n works well enough for my purpose. I am editing OCRed text where tables were converted as plain text and wanted to put <tr> </tr> around each line.
|
![]() |
![]() |
![]() |
#9 |
Bibliophagist
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 46,181
Karma: 168983734
Join Date: Jul 2010
Location: Vancouver
Device: Kobo Sage, Libra Colour, Lenovo M8 FHD, Paperwhite 4, Tolino epos
|
I have some OCRed files to look at starting at some time in the next week. The author recovered her rights to the books but the publisher no longer has any of her submitted files or, perhaps, does not want to offer any help. Due to a computer crash years back, she no longer has the files either. What she has is scanned copies of the pages stored as a multi-page TIFF file for each section which I am going to clean up and convert into ePubs. The scans look clean so I'm hopeful that it's not going to be a total PITA.
Ah well, it'll pay for my ebook addiction for a few months. |
![]() |
![]() |
![]() |
#10 |
Sigil Developer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 8,761
Karma: 5706256
Join Date: Nov 2009
Device: many
|
Okay I have spent a long time looking at this and it seems the problem is related to internal PCRE2 settings/optimizations that happen when DOTSTAR is is used to start a regular expression search pattern.
It appears you can turn off this behaviour by prefacing that expression as follows: (*NOTEMPTY) which tells PCRE2 not to return any empty matches. So please try the following: (*NOTEMPTY).* as your search pattern. It should give you what you want. |
![]() |
![]() |
![]() |
#11 |
Sigil Developer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 8,761
Karma: 5706256
Join Date: Nov 2009
Device: many
|
I found a way to add PCRE2_NOTEMPTY as an option flag for the pcre2 match routine and hard code it in the Sigil code.
In that way Sigil search would default to what you expected (and what most people would) from the beginning. I just need to make sure that this change does not end up breaking anything else. Last edited by KevinH; 06-24-2025 at 05:23 PM. |
![]() |
![]() |
![]() |
#12 | |
Sigil Developer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 8,761
Karma: 5706256
Join Date: Nov 2009
Device: many
|
Quote:
|
|
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
regex newbie search end of string char problem | michaelbr | Sigil | 6 | 10-15-2020 01:54 PM |
Regex Search in Advanced Search Box | franknight | Library Management | 2 | 07-08-2020 11:42 PM |
Regex in search problems (NOT Search&Replace; the search bar) | lairdb | Calibre | 3 | 03-15-2017 07:10 PM |
Regex Search doesn't search all files in Edit Book | GregTheGrate | Editor | 8 | 11-08-2016 12:47 AM |
Search regex problem | ColMac | Editor | 23 | 04-17-2015 03:58 PM |