Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Sigil

Notices

Reply
 
Thread Tools Search this Thread
Old 12-11-2010, 05:09 AM   #1
kiwidude
Calibre Plugins Developer
kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.
 
Posts: 4,673
Karma: 2162246
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
Regular expression for matching div tags?

Hi all,

Spent way too much time on this without success so hopefully a regex guru can help me.

I have an xhtml document in Sigil that has a lot of nasty formatting that I want to remove. Specifically it has a series of <div> tags surrounding sets of paragraphs.

I have been trying to do a find/replace and the issue I have is trying to do a "non-greedy" match. The text looks like the following (it does not nest div tags):
Code:
 <div class="s4">
    <p class="calibre4">Blah blah</p>
    <p class="calibre4">Blah blah</p>
    <p class="calibre4">Blah blah</p>
 </div>
 <div class="s6">
    <p class="calibre4">Blah blah</p>
 </div>
 <div class="s4">
    <p class="calibre4">Blah blah</p>
 </div>
Now let's say I am only interested in selecting the <div class="s4"> blocks and stripping their outer div tags.

What regex should I use? I've looked into negative lookups as well as non-greedy matches but my head hurts from lack of success. At it's simplest I had hoped I could use something like:
Find: <div class="s4">(.*?)</div>
Replace: \1

However that doesn't work. Could someone please suggest something? Worst case I will just remove the class from the div tags so it does nothing but it has now reached the point of insulting my pride if I let it completely beat me
kiwidude is offline   Reply With Quote
Old 12-11-2010, 05:26 AM   #2
RobW
Rob Wheeler (Kent, UK)
RobW is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!RobW is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!RobW is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!RobW is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!RobW is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!RobW is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!RobW is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!RobW is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!RobW is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!RobW is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!RobW is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!
 
Posts: 13
Karma: 50000
Join Date: Oct 2010
Location: Kent, UK
Device: Sony PRS-650
I am new to the forum and only just noticed your post. The pattern you quote works fine. But regex engines vary somewhat. I tried yours out in my editor, Editpro, and it worked and I'm pretty sure it would work under Perl. Somehow you need to flag the pattern as being 'muliti-line'. Dont know whether Sigil has the facility. RobW
RobW is offline   Reply With Quote
Advert
Old 12-11-2010, 05:35 AM   #3
TheGreatGig
Junior Member
TheGreatGig began at the beginning.
 
TheGreatGig's Avatar
 
Posts: 4
Karma: 48
Join Date: Dec 2010
Device: none, yet
Why the quotation mark?

Find: <div class="s4">(.*)</div>
Replace: \1

(with minimal matching)
TheGreatGig is offline   Reply With Quote
Old 12-11-2010, 05:50 AM   #4
kiwidude
Calibre Plugins Developer
kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.
 
Posts: 4,673
Karma: 2162246
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
Ahhh brilliant, thanks to both of you. RobW - yeah I was starting to wonder if it was something about how Sigil was using Regex, which got me looking elsewhere on the dialog to find the "minimal matching" checkbox which my brain had completely ignored until now.

And thanks to TheGreatGig for then confirming what I was about to try. The quotation mark was to request a non-greedy match which I believe is "normal" regular expression syntax. I did not realise until just now that Sigil had this alternatively encapsulated into a simple checkbox to select.

Job done, thank you both.
kiwidude is offline   Reply With Quote
Old 12-11-2010, 06:25 AM   #5
Ahmad Samir
Zealot
Ahmad Samir , Klaatu Barada Niktu!Ahmad Samir , Klaatu Barada Niktu!Ahmad Samir , Klaatu Barada Niktu!Ahmad Samir , Klaatu Barada Niktu!Ahmad Samir , Klaatu Barada Niktu!Ahmad Samir , Klaatu Barada Niktu!Ahmad Samir , Klaatu Barada Niktu!Ahmad Samir , Klaatu Barada Niktu!Ahmad Samir , Klaatu Barada Niktu!Ahmad Samir , Klaatu Barada Niktu!Ahmad Samir , Klaatu Barada Niktu!
 
Posts: 114
Karma: 5246
Join Date: Jul 2010
Device: none
http://web.sigil.googlecode.com/hg/s...xpression-mode
Ahmad Samir is offline   Reply With Quote
Advert
Old 12-11-2010, 08:22 AM   #6
theducks
Well trained by Cats
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 30,362
Karma: 58053698
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
Not 100% tested

My experience is that "Tidy" finds the "extra" CLOSING tag and deletes that auto-magically.
I have deleted the opening Tag... Presto, Closing Tag gone when you force a refresh (CV<->BV)


Again. not 100% tested against all cases. This seems to work over 1 Paragraph or the entire document.
theducks is offline   Reply With Quote
Old 12-11-2010, 11:28 AM   #7
Perkin
Guru
Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.
 
Perkin's Avatar
 
Posts: 657
Karma: 64171
Join Date: Sep 2010
Location: Kent, England, Sol 3, ZZ9 plural Z Alpha
Device: Sony PRS-300, Kobo Aura HD, iPad (Marvin)
I think the best way would be to do like theducks said,
Search for the <div class="s4"> and replace with empty string, Sigil will remove the corresponding closing tag, and I think the search from 3rd post wouldn't take into account things like having one div embedded inside a class="s4" div
Perkin is offline   Reply With Quote
Old 12-11-2010, 12:08 PM   #8
Jellby
frumious Bandersnatch
Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.
 
Jellby's Avatar
 
Posts: 7,531
Karma: 19000001
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
It's sometimes easier to first replace some things with single characters that are not used anywhere else (¬ and | are likely), and then do further regex work with them, because negative patterns are easier with single characters.

For instance, if you first replace every <i> with ¬ and every </i> with |, you can now find nested italics markup with "¬[^|]*¬".
Jellby is offline   Reply With Quote
Old 12-11-2010, 12:45 PM   #9
Perkin
Guru
Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.
 
Perkin's Avatar
 
Posts: 657
Karma: 64171
Join Date: Sep 2010
Location: Kent, England, Sol 3, ZZ9 plural Z Alpha
Device: Sony PRS-300, Kobo Aura HD, iPad (Marvin)
The | character is used by regex itself, as an 'or', so use something different
Perkin is offline   Reply With Quote
Old 12-11-2010, 01:15 PM   #10
kiwidude
Calibre Plugins Developer
kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.
 
Posts: 4,673
Karma: 2162246
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
I was fortunate that as stated above the div tags were not nested, so using that checkbox did what I wanted. Thanks also for the heads up on the "auto tag cleanup" possibility too, I'm very new to Sigil so just finding out some of it's "tricks" (and unexpected quirks sometimes because of them).

How often does Sigil get released/updated? Like any other software there are a bunch of minor things that either annoy by omission or behave in an way that means a lot of repeated keyboard/mouse swapping actions I know could be streamlined. Is it worth diving into the source to hack around or should I just be patient?
kiwidude is offline   Reply With Quote
Old 12-12-2010, 04:25 AM   #11
Jellby
frumious Bandersnatch
Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.
 
Jellby's Avatar
 
Posts: 7,531
Karma: 19000001
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
Quote:
Originally Posted by Perkin View Post
The | character is used by regex itself, as an 'or', so use something different
Anyway, it can be referred to in regex with \| if needed (sometimes the backslash is not needed inside the brackets). Many-character expressions are not so easy to exclude, at least in the regex dialects I've seen.
Jellby is offline   Reply With Quote
Old 12-12-2010, 06:11 AM   #12
Ahmad Samir
Zealot
Ahmad Samir , Klaatu Barada Niktu!Ahmad Samir , Klaatu Barada Niktu!Ahmad Samir , Klaatu Barada Niktu!Ahmad Samir , Klaatu Barada Niktu!Ahmad Samir , Klaatu Barada Niktu!Ahmad Samir , Klaatu Barada Niktu!Ahmad Samir , Klaatu Barada Niktu!Ahmad Samir , Klaatu Barada Niktu!Ahmad Samir , Klaatu Barada Niktu!Ahmad Samir , Klaatu Barada Niktu!Ahmad Samir , Klaatu Barada Niktu!
 
Posts: 114
Karma: 5246
Join Date: Jul 2010
Device: none
Quote:
Originally Posted by kiwidude View Post
I was fortunate that as stated above the div tags were not nested, so using that checkbox did what I wanted. Thanks also for the heads up on the "auto tag cleanup" possibility too, I'm very new to Sigil so just finding out some of it's "tricks" (and unexpected quirks sometimes because of them).

How often does Sigil get released/updated? Like any other software there are a bunch of minor things that either annoy by omission or behave in an way that means a lot of repeated keyboard/mouse swapping actions I know could be streamlined. Is it worth diving into the source to hack around or should I just be patient?
I'd say often; check the changelog: https://sigil.googlecode.com/hg/ChangeLog.txt
Ahmad Samir is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Regular Expression Help Azhad Calibre 86 09-27-2011 02:37 PM
Regular Expression Help iKarampa Calibre 13 12-15-2010 07:17 AM
Regular expression help krendk Calibre 4 12-04-2010 04:32 PM
Regular Expression Help smartmart Calibre 5 10-17-2010 05:19 AM
Help with the regular expression Dysonco Calibre 9 03-22-2010 10:45 PM


All times are GMT -4. The time now is 04:34 PM.


MobileRead.com is a privately owned, operated and funded community.