![]() |
#1 |
Groupie
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 166
Karma: 474196
Join Date: Jan 2011
Location: Canada
Device: Kobo Libra 2
|
Regex help: Find instances spanning several paragraphs
I've come across this on several occasions and was curious if there's a good regex for it. A non-specific typical example would be to select an entire <div> or <blockquote>, which could contain any number of paragraphs, such as:
Code:
<div class="foo"> <p>aaa</p> <p>bbb</p> [...] <p>ccc</p> </div> Code:
<div class="foo"> <p>(.*?)</p> </div> Code:
<div class="foo"> <p>(.*?)</p> <p>(.*?)</p> </div> Code:
<div class="foo">(.*?)</div> So, I feel like one of the regex settings ought to allow searches to skip whitespace or whatever and this is probably an easy fix, but I'm not sure what it might be. Also, for the record, I do have the TagMechanic plugin which helps in most situations like this, but in some cases it would be nice for me to be able to iterate through all instances with a regular F&R process. |
![]() |
![]() |
![]() |
#2 |
Zealot
![]() Posts: 100
Karma: 10
Join Date: Aug 2019
Device: none
|
sigil regex has opinion 'dot all' for let dot match all characters, including '\n'. With this opinion you can search across lines.
But use `<div class="foo">(.*?)</div>` is not a good idea, for there are situations like `div ... div ... /div ... /div` |
![]() |
![]() |
Advert | |
|
![]() |
#3 |
A Hairy Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 3,295
Karma: 20171067
Join Date: Dec 2012
Location: Charleston, SC today
Device: iPhone 15/11/X/6/iPad 1,2,Air & Air Pro/Surface Pro/Kindle PW & Fire
|
Yes, the dot all option works nicely in sigil.
Find: <div class="foo">\s*(.*?)\s*</div> I add the \s* to get any space between the div tags and the paragraphs...I don't want them replicated in addition to whatever spacing I add in the Replace: line. As The_book mentioned, be aware of nested div's...this will capture anything inside your foo div. |
![]() |
![]() |
![]() |
#4 |
Groupie
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 166
Karma: 474196
Join Date: Jan 2011
Location: Canada
Device: Kobo Libra 2
|
Thanks, the dot all options seems to be what I was looking for. I think I had those regex settings largely the way they were by trial an error after getting some wierdness finding two <i></i> blocks in the same paragraph (though that's probably a minimal match issue).
As I mentioned, with regexes encompassing larger potential hits like this I typically iterate through them one at a time since as you say there is a lot of room for error. So between the hammer (TagMechanic) and the scalpel (dotall search) I think I should have my bases covered. Thanks to you both! |
![]() |
![]() |
![]() |
#5 |
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 28,341
Karma: 203719142
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
|
|
![]() |
![]() |
Advert | |
|
![]() |
#6 |
Groupie
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 166
Karma: 474196
Join Date: Jan 2011
Location: Canada
Device: Kobo Libra 2
|
How so? Unless I'm overlooking a feature, the only way to limit what it finds is by selecting fewer files, right? So if I wanted to (for example) change all <div class="foo"></div> to blockquotes, it's all-or-nothing, at least within the files I've selected?
Whereas, with a dotall search I can do a find without replacing to be taken to a hit, replace, and immediately verify the results. I can also add to the regex if I want to do something like altering the line immediately below the div block. Don't get me wrong - your TagMechanic plugin has been a massive help for me! I've recently embarked on a massive library overhaul and it's expedited this process immensely. |
![]() |
![]() |
![]() |
#7 |
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 28,341
Karma: 203719142
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
|
TagMechanic parses html in order to edit, delete, and modify tags. It knows which closing tag goes with which opening tag so that it can make correct edits to snarls of nested tags. It will not get confused in nested situations. Nor will it get "greedy" as regex is prone to do. TagMechanic is a subtle, HTML aware tool that can be used with precision to change, cleanup or delete html tags. Regex is blunt instrument that knows nothing about the markup it is trying to match/replace.
I like regex--use it all the time. But that doesn't change the fact that turning it loose on the kind of xhtml that needs to be properly parsed to be safely edited is like throwing a bag of hammers at a box of nails and a pile lumber and hoping a chair gets made. ![]() I learned a long time ago to use regex where it makes sense (and there's tons of places it does). But don't try to use it to parse markup. It will eventually let you down. I'm not saying that regex might not be the better choice for your situation. I'm just saying that a precision tool used to make smart, safe, changes to convoluted/nested html cannot be accurately described as a "hammer." ![]() |
![]() |
![]() |
![]() |
Tags |
find & replace, regex |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Find and replace first x instances of each word from a list of words | Yadang | Editor | 0 | 12-30-2020 02:59 PM |
New to regex find and replace! | ksimpson1986 | Sigil | 2 | 11-06-2016 11:29 AM |
Regex find and replace | SanatyrZeo | Sigil | 5 | 10-29-2012 07:03 AM |
epub to epub conversion problem with regex spanning multiple input files | ctop | Conversion | 2 | 02-12-2012 01:56 AM |
REGEX find and replace help please | potestus | Sigil | 13 | 09-18-2010 04:14 PM |