12-11-2010, 05:09 AM | #1 |
Calibre Plugins Developer
Posts: 4,673
Karma: 2162246
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
Regular expression for matching div tags?
Hi all,
Spent way too much time on this without success so hopefully a regex guru can help me. I have an xhtml document in Sigil that has a lot of nasty formatting that I want to remove. Specifically it has a series of <div> tags surrounding sets of paragraphs. I have been trying to do a find/replace and the issue I have is trying to do a "non-greedy" match. The text looks like the following (it does not nest div tags): Code:
<div class="s4"> <p class="calibre4">Blah blah</p> <p class="calibre4">Blah blah</p> <p class="calibre4">Blah blah</p> </div> <div class="s6"> <p class="calibre4">Blah blah</p> </div> <div class="s4"> <p class="calibre4">Blah blah</p> </div> What regex should I use? I've looked into negative lookups as well as non-greedy matches but my head hurts from lack of success. At it's simplest I had hoped I could use something like: Find: <div class="s4">(.*?)</div> Replace: \1 However that doesn't work. Could someone please suggest something? Worst case I will just remove the class from the div tags so it does nothing but it has now reached the point of insulting my pride if I let it completely beat me |
12-11-2010, 05:26 AM | #2 |
Rob Wheeler (Kent, UK)
Posts: 13
Karma: 50000
Join Date: Oct 2010
Location: Kent, UK
Device: Sony PRS-650
|
I am new to the forum and only just noticed your post. The pattern you quote works fine. But regex engines vary somewhat. I tried yours out in my editor, Editpro, and it worked and I'm pretty sure it would work under Perl. Somehow you need to flag the pattern as being 'muliti-line'. Dont know whether Sigil has the facility. RobW
|
Advert | |
|
12-11-2010, 05:35 AM | #3 |
Junior Member
Posts: 4
Karma: 48
Join Date: Dec 2010
Device: none, yet
|
Why the quotation mark?
Find: <div class="s4">(.*)</div> Replace: \1 (with minimal matching) |
12-11-2010, 05:50 AM | #4 |
Calibre Plugins Developer
Posts: 4,673
Karma: 2162246
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
Ahhh brilliant, thanks to both of you. RobW - yeah I was starting to wonder if it was something about how Sigil was using Regex, which got me looking elsewhere on the dialog to find the "minimal matching" checkbox which my brain had completely ignored until now.
And thanks to TheGreatGig for then confirming what I was about to try. The quotation mark was to request a non-greedy match which I believe is "normal" regular expression syntax. I did not realise until just now that Sigil had this alternatively encapsulated into a simple checkbox to select. Job done, thank you both. |
12-11-2010, 06:25 AM | #5 |
Zealot
Posts: 114
Karma: 5246
Join Date: Jul 2010
Device: none
|
|
Advert | |
|
12-11-2010, 08:22 AM | #6 |
Well trained by Cats
Posts: 30,362
Karma: 58053698
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
|
Not 100% tested
My experience is that "Tidy" finds the "extra" CLOSING tag and deletes that auto-magically. I have deleted the opening Tag... Presto, Closing Tag gone when you force a refresh (CV<->BV) Again. not 100% tested against all cases. This seems to work over 1 Paragraph or the entire document. |
12-11-2010, 11:28 AM | #7 |
Guru
Posts: 657
Karma: 64171
Join Date: Sep 2010
Location: Kent, England, Sol 3, ZZ9 plural Z Alpha
Device: Sony PRS-300, Kobo Aura HD, iPad (Marvin)
|
I think the best way would be to do like theducks said,
Search for the <div class="s4"> and replace with empty string, Sigil will remove the corresponding closing tag, and I think the search from 3rd post wouldn't take into account things like having one div embedded inside a class="s4" div |
12-11-2010, 12:08 PM | #8 |
frumious Bandersnatch
Posts: 7,531
Karma: 19000001
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
|
It's sometimes easier to first replace some things with single characters that are not used anywhere else (¬ and | are likely), and then do further regex work with them, because negative patterns are easier with single characters.
For instance, if you first replace every <i> with ¬ and every </i> with |, you can now find nested italics markup with "¬[^|]*¬". |
12-11-2010, 12:45 PM | #9 |
Guru
Posts: 657
Karma: 64171
Join Date: Sep 2010
Location: Kent, England, Sol 3, ZZ9 plural Z Alpha
Device: Sony PRS-300, Kobo Aura HD, iPad (Marvin)
|
The | character is used by regex itself, as an 'or', so use something different
|
12-11-2010, 01:15 PM | #10 |
Calibre Plugins Developer
Posts: 4,673
Karma: 2162246
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
I was fortunate that as stated above the div tags were not nested, so using that checkbox did what I wanted. Thanks also for the heads up on the "auto tag cleanup" possibility too, I'm very new to Sigil so just finding out some of it's "tricks" (and unexpected quirks sometimes because of them).
How often does Sigil get released/updated? Like any other software there are a bunch of minor things that either annoy by omission or behave in an way that means a lot of repeated keyboard/mouse swapping actions I know could be streamlined. Is it worth diving into the source to hack around or should I just be patient? |
12-12-2010, 04:25 AM | #11 |
frumious Bandersnatch
Posts: 7,531
Karma: 19000001
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
|
Anyway, it can be referred to in regex with \| if needed (sometimes the backslash is not needed inside the brackets). Many-character expressions are not so easy to exclude, at least in the regex dialects I've seen.
|
12-12-2010, 06:11 AM | #12 | |
Zealot
Posts: 114
Karma: 5246
Join Date: Jul 2010
Device: none
|
Quote:
|
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Regular Expression Help | Azhad | Calibre | 86 | 09-27-2011 02:37 PM |
Regular Expression Help | iKarampa | Calibre | 13 | 12-15-2010 07:17 AM |
Regular expression help | krendk | Calibre | 4 | 12-04-2010 04:32 PM |
Regular Expression Help | smartmart | Calibre | 5 | 10-17-2010 05:19 AM |
Help with the regular expression | Dysonco | Calibre | 9 | 03-22-2010 10:45 PM |