Regex help: Find instances spanning several paragraphs

Vanguard3000 · 05-09-2023, 05:49 PM

I've come across this on several occasions and was curious if there's a good regex for it. A non-specific typical example would be to select an entire <div> or <blockquote>, which could contain any number of paragraphs, such as:

Code:

<div class="foo">
<p>aaa</p>
<p>bbb</p>
[...]
<p>ccc</p>
</div>

Currently, I'd need to do something like:

Code:

<div class="foo">    <p>(.*?)</p>    </div>

then:

Code:

<div class="foo">    <p>(.*?)</p>    <p>(.*?)</p>    </div>

and so on. Ideally I'd like to do:

Code:

<div class="foo">(.*?)</div>

but it doesn't work, I assume due to too much whitespace, returns, etc.

So, I feel like one of the regex settings ought to allow searches to skip whitespace or whatever and this is probably an easy fix, but I'm not sure what it might be.

Also, for the record, I do have the TagMechanic plugin which helps in most situations like this, but in some cases it would be nice for me to be able to iterate through all instances with a regular F&R process.

The_book · 05-10-2023, 08:14 AM

sigil regex has opinion 'dot all' for let dot match all characters, including '\n'. With this opinion you can search across lines.
But use `<div class="foo">(.*?)</div>` is not a good idea, for there are situations like `div ... div ... /div ... /div`

Turtle91 · 05-10-2023, 08:21 AM

Yes, the dot all option works nicely in sigil.

Find: <div class="foo">\s*(.*?)\s*</div>

I add the \s* to get any space between the div tags and the paragraphs...I don't want them replicated in addition to whatever spacing I add in the Replace: line.

As The_book mentioned, be aware of nested div's...this will capture anything inside your foo div.

Vanguard3000 · 05-10-2023, 11:13 AM

Thanks, the dot all options seems to be what I was looking for. I think I had those regex settings largely the way they were by trial an error after getting some wierdness finding two <i></i> blocks in the same paragraph (though that's probably a minimal match issue).

As I mentioned, with regexes encompassing larger potential hits like this I typically iterate through them one at a time since as you say there is a lot of room for error. So between the hammer (TagMechanic) and the scalpel (dotall search) I think I should have my bases covered.

Thanks to you both!

DiapDealer · 05-10-2023, 11:22 AM

Quote:

Originally Posted by Vanguard3000

So between the hammer (TagMechanic) and the scalpel (dotall search) I think I should have my bases covered.

As the author of TagMechanic, it is of course my opinion that you have those labels reversed!

But I'm not mad.

Vanguard3000 · 05-10-2023, 12:03 PM

How so? Unless I'm overlooking a feature, the only way to limit what it finds is by selecting fewer files, right? So if I wanted to (for example) change all <div class="foo"></div> to blockquotes, it's all-or-nothing, at least within the files I've selected?

Whereas, with a dotall search I can do a find without replacing to be taken to a hit, replace, and immediately verify the results. I can also add to the regex if I want to do something like altering the line immediately below the div block.

Don't get me wrong - your TagMechanic plugin has been a massive help for me! I've recently embarked on a massive library overhaul and it's expedited this process immensely.

DiapDealer · 05-10-2023, 02:46 PM

TagMechanic parses html in order to edit, delete, and modify tags. It knows which closing tag goes with which opening tag so that it can make correct edits to snarls of nested tags. It will not get confused in nested situations. Nor will it get "greedy" as regex is prone to do. TagMechanic is a subtle, HTML aware tool that can be used with precision to change, cleanup or delete html tags. Regex is blunt instrument that knows nothing about the markup it is trying to match/replace.

I like regex--use it all the time. But that doesn't change the fact that turning it loose on the kind of xhtml that needs to be properly parsed to be safely edited is like throwing a bag of hammers at a box of nails and a pile lumber and hoping a chair gets made.

I learned a long time ago to use regex where it makes sense (and there's tons of places it does). But don't try to use it to parse markup. It will eventually let you down.

I'm not saying that regex might not be the better choice for your situation. I'm just saying that a precision tool used to make smart, safe, changes to convoluted/nested html cannot be accurately described as a "hammer."

05-09-2023, 05:49 PM	#1
Vanguard3000 Groupie Posts: 171 Karma: 474196 Join Date: Jan 2011 Location: Canada Device: Kobo Libra 2	Regex help: Find instances spanning several paragraphs I've come across this on several occasions and was curious if there's a good regex for it. A non-specific typical example would be to select an entire <div> or <blockquote>, which could contain any number of paragraphs, such as: Code: <div class="foo"> <p>aaa</p> <p>bbb</p> [...] <p>ccc</p> </div> Currently, I'd need to do something like: Code: <div class="foo"> <p>(.?)</p> </div> then: Code: <div class="foo"> <p>(.?)</p> <p>(.?)</p> </div> and so on. Ideally I'd like to do: Code: <div class="foo">(.?)</div> but it doesn't work, I assume due to too much whitespace, returns, etc. So, I feel like one of the regex settings ought to allow searches to skip whitespace or whatever and this is probably an easy fix, but I'm not sure what it might be. Also, for the record, I do have the TagMechanic plugin which helps in most situations like this, but in some cases it would be nice for me to be able to iterate through all instances with a regular F&R process.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Find and replace first x instances of each word from a list of words	Yadang	Editor	0	12-30-2020 03:59 PM
New to regex find and replace!	ksimpson1986	Sigil	2	11-06-2016 12:29 PM
Regex find and replace	SanatyrZeo	Sigil	5	10-29-2012 08:03 AM
epub to epub conversion problem with regex spanning multiple input files	ctop	Conversion	2	02-12-2012 02:56 AM
REGEX find and replace help please	potestus	Sigil	13	09-18-2010 05:14 PM

05-10-2023, 08:14 AM	#2
The_book Zealot Posts: 100 Karma: 10 Join Date: Aug 2019 Device: none	sigil regex has opinion 'dot all' for let dot match all characters, including '\n'. With this opinion you can search across lines. But use `<div class="foo">(.*?)</div>` is not a good idea, for there are situations like `div ... div ... /div ... /div`

05-10-2023, 08:21 AM	#3
Turtle91 A Hairy Wizard Posts: 3,417 Karma: 20212733 Join Date: Dec 2012 Location: Charleston, SC today Device: iPhone 15/11/X/6/iPad 1,2,Air & Air Pro/Surface Pro/Kindle PW & Fire	Yes, the dot all option works nicely in sigil. Find: <div class="foo">\s(.?)\s</div> I add the \s to get any space between the div tags and the paragraphs...I don't want them replicated in addition to whatever spacing I add in the Replace: line. As The_book mentioned, be aware of nested div's...this will capture anything inside your foo div.

05-10-2023, 11:13 AM	#4
Vanguard3000 Groupie Posts: 171 Karma: 474196 Join Date: Jan 2011 Location: Canada Device: Kobo Libra 2	Thanks, the dot all options seems to be what I was looking for. I think I had those regex settings largely the way they were by trial an error after getting some wierdness finding two <i></i> blocks in the same paragraph (though that's probably a minimal match issue). As I mentioned, with regexes encompassing larger potential hits like this I typically iterate through them one at a time since as you say there is a lot of room for error. So between the hammer (TagMechanic) and the scalpel (dotall search) I think I should have my bases covered. Thanks to you both!

05-10-2023, 12:03 PM	#6
Vanguard3000 Groupie Posts: 171 Karma: 474196 Join Date: Jan 2011 Location: Canada Device: Kobo Libra 2	How so? Unless I'm overlooking a feature, the only way to limit what it finds is by selecting fewer files, right? So if I wanted to (for example) change all <div class="foo"></div> to blockquotes, it's all-or-nothing, at least within the files I've selected? Whereas, with a dotall search I can do a find without replacing to be taken to a hit, replace, and immediately verify the results. I can also add to the regex if I want to do something like altering the line immediately below the div block. Don't get me wrong - your TagMechanic plugin has been a massive help for me! I've recently embarked on a massive library overhaul and it's expedited this process immensely.

05-10-2023, 02:46 PM	#7
DiapDealer Grand Sorcerer Posts: 28,992 Karma: 210162574 Join Date: Jan 2010 Device: Nexus 7, Kindle Fire HD	TagMechanic parses html in order to edit, delete, and modify tags. It knows which closing tag goes with which opening tag so that it can make correct edits to snarls of nested tags. It will not get confused in nested situations. Nor will it get "greedy" as regex is prone to do. TagMechanic is a subtle, HTML aware tool that can be used with precision to change, cleanup or delete html tags. Regex is blunt instrument that knows nothing about the markup it is trying to match/replace. I like regex--use it all the time. But that doesn't change the fact that turning it loose on the kind of xhtml that needs to be properly parsed to be safely edited is like throwing a bag of hammers at a box of nails and a pile lumber and hoping a chair gets made. I learned a long time ago to use regex where it makes sense (and there's tons of places it does). But don't try to use it to parse markup. It will eventually let you down. I'm not saying that regex might not be the better choice for your situation. I'm just saying that a precision tool used to make smart, safe, changes to convoluted/nested html cannot be accurately described as a "hammer."

Advert

Advert