Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Sigil

Notices

Reply
 
Thread Tools Search this Thread
Old 12-11-2010, 06:09 AM   #1
kiwidude
calibre/Sigil Developer
kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.
 
Posts: 4,230
Karma: 1345754
Join Date: Oct 2010
Location: London, UK
Device: Kindle Paperwhite 3G, iPad 3, iPad Air
Regular expression for matching div tags?

Hi all,

Spent way too much time on this without success so hopefully a regex guru can help me.

I have an xhtml document in Sigil that has a lot of nasty formatting that I want to remove. Specifically it has a series of <div> tags surrounding sets of paragraphs.

I have been trying to do a find/replace and the issue I have is trying to do a "non-greedy" match. The text looks like the following (it does not nest div tags):
Code:
 <div class="s4">
    <p class="calibre4">Blah blah</p>
    <p class="calibre4">Blah blah</p>
    <p class="calibre4">Blah blah</p>
 </div>
 <div class="s6">
    <p class="calibre4">Blah blah</p>
 </div>
 <div class="s4">
    <p class="calibre4">Blah blah</p>
 </div>
Now let's say I am only interested in selecting the <div class="s4"> blocks and stripping their outer div tags.

What regex should I use? I've looked into negative lookups as well as non-greedy matches but my head hurts from lack of success. At it's simplest I had hoped I could use something like:
Find: <div class="s4">(.*?)</div>
Replace: \1

However that doesn't work. Could someone please suggest something? Worst case I will just remove the class from the div tags so it does nothing but it has now reached the point of insulting my pride if I let it completely beat me
kiwidude is offline   Reply With Quote
Old 12-11-2010, 06:26 AM   #2
RobW
Rob Wheeler (Kent, UK)
RobW is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!RobW is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!RobW is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!RobW is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!RobW is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!RobW is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!RobW is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!RobW is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!RobW is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!RobW is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!RobW is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!
 
Posts: 13
Karma: 50000
Join Date: Oct 2010
Location: Kent, UK
Device: Sony PRS-650
I am new to the forum and only just noticed your post. The pattern you quote works fine. But regex engines vary somewhat. I tried yours out in my editor, Editpro, and it worked and I'm pretty sure it would work under Perl. Somehow you need to flag the pattern as being 'muliti-line'. Dont know whether Sigil has the facility. RobW
RobW is offline   Reply With Quote
 
Advertisement
Old 12-11-2010, 06:35 AM   #3
TheGreatGig
Junior Member
TheGreatGig began at the beginning.
 
TheGreatGig's Avatar
 
Posts: 4
Karma: 48
Join Date: Dec 2010
Device: none, yet
Why the quotation mark?

Find: <div class="s4">(.*)</div>
Replace: \1

(with minimal matching)
TheGreatGig is offline   Reply With Quote
Old 12-11-2010, 06:50 AM   #4
kiwidude
calibre/Sigil Developer
kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.
 
Posts: 4,230
Karma: 1345754
Join Date: Oct 2010
Location: London, UK
Device: Kindle Paperwhite 3G, iPad 3, iPad Air
Ahhh brilliant, thanks to both of you. RobW - yeah I was starting to wonder if it was something about how Sigil was using Regex, which got me looking elsewhere on the dialog to find the "minimal matching" checkbox which my brain had completely ignored until now.

And thanks to TheGreatGig for then confirming what I was about to try. The quotation mark was to request a non-greedy match which I believe is "normal" regular expression syntax. I did not realise until just now that Sigil had this alternatively encapsulated into a simple checkbox to select.

Job done, thank you both.
kiwidude is offline   Reply With Quote
Old 12-11-2010, 07:25 AM   #5
Ahmad Samir
Zealot
Ahmad Samir , Klaatu Barada Niktu!Ahmad Samir , Klaatu Barada Niktu!Ahmad Samir , Klaatu Barada Niktu!Ahmad Samir , Klaatu Barada Niktu!Ahmad Samir , Klaatu Barada Niktu!Ahmad Samir , Klaatu Barada Niktu!Ahmad Samir , Klaatu Barada Niktu!Ahmad Samir , Klaatu Barada Niktu!Ahmad Samir , Klaatu Barada Niktu!Ahmad Samir , Klaatu Barada Niktu!Ahmad Samir , Klaatu Barada Niktu!
 
Posts: 114
Karma: 5246
Join Date: Jul 2010
Device: none
http://web.sigil.googlecode.com/hg/s...xpression-mode
Ahmad Samir is offline   Reply With Quote
Old 12-11-2010, 09:22 AM   #6
theducks
Grand Sorcerer
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 15,244
Karma: 6020307
Join Date: Aug 2009
Location: (The original) Silicon Valley, USA
Device: Galaxy Tab 2, Astak Pocket Pro, K4NT
Not 100% tested

My experience is that "Tidy" finds the "extra" CLOSING tag and deletes that auto-magically.
I have deleted the opening Tag... Presto, Closing Tag gone when you force a refresh (CV<->BV)


Again. not 100% tested against all cases. This seems to work over 1 Paragraph or the entire document.
theducks is offline   Reply With Quote
Old 12-11-2010, 12:28 PM   #7
Perkin
Guru
Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.
 
Perkin's Avatar
 
Posts: 655
Karma: 64171
Join Date: Sep 2010
Location: Kent, England, Sol 3, ZZ9 plural Z Alpha
Device: Sony PRS-300, Kobo Aura HD, iPad (Marvin)
I think the best way would be to do like theducks said,
Search for the <div class="s4"> and replace with empty string, Sigil will remove the corresponding closing tag, and I think the search from 3rd post wouldn't take into account things like having one div embedded inside a class="s4" div
Perkin is offline   Reply With Quote
Old 12-11-2010, 01:08 PM   #8
Jellby
frumious Bandersnatch
Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.
 
Jellby's Avatar
 
Posts: 6,309
Karma: 4898871
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
It's sometimes easier to first replace some things with single characters that are not used anywhere else (¬ and | are likely), and then do further regex work with them, because negative patterns are easier with single characters.

For instance, if you first replace every <i> with ¬ and every </i> with |, you can now find nested italics markup with "¬[^|]*¬".
Jellby is offline   Reply With Quote
Old 12-11-2010, 01:45 PM   #9
Perkin
Guru
Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.Perkin calls his or her ebook reader Vera.
 
Perkin's Avatar
 
Posts: 655
Karma: 64171
Join Date: Sep 2010
Location: Kent, England, Sol 3, ZZ9 plural Z Alpha
Device: Sony PRS-300, Kobo Aura HD, iPad (Marvin)
The | character is used by regex itself, as an 'or', so use something different
Perkin is offline   Reply With Quote
Old 12-11-2010, 02:15 PM   #10
kiwidude
calibre/Sigil Developer
kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.
 
Posts: 4,230
Karma: 1345754
Join Date: Oct 2010
Location: London, UK
Device: Kindle Paperwhite 3G, iPad 3, iPad Air
I was fortunate that as stated above the div tags were not nested, so using that checkbox did what I wanted. Thanks also for the heads up on the "auto tag cleanup" possibility too, I'm very new to Sigil so just finding out some of it's "tricks" (and unexpected quirks sometimes because of them).

How often does Sigil get released/updated? Like any other software there are a bunch of minor things that either annoy by omission or behave in an way that means a lot of repeated keyboard/mouse swapping actions I know could be streamlined. Is it worth diving into the source to hack around or should I just be patient?
kiwidude is offline   Reply With Quote
Old 12-12-2010, 05:25 AM   #11
Jellby
frumious Bandersnatch
Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.
 
Jellby's Avatar
 
Posts: 6,309
Karma: 4898871
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
Quote:
Originally Posted by Perkin View Post
The | character is used by regex itself, as an 'or', so use something different
Anyway, it can be referred to in regex with \| if needed (sometimes the backslash is not needed inside the brackets). Many-character expressions are not so easy to exclude, at least in the regex dialects I've seen.
Jellby is offline   Reply With Quote
Old 12-12-2010, 07:11 AM   #12
Ahmad Samir
Zealot
Ahmad Samir , Klaatu Barada Niktu!Ahmad Samir , Klaatu Barada Niktu!Ahmad Samir , Klaatu Barada Niktu!Ahmad Samir , Klaatu Barada Niktu!Ahmad Samir , Klaatu Barada Niktu!Ahmad Samir , Klaatu Barada Niktu!Ahmad Samir , Klaatu Barada Niktu!Ahmad Samir , Klaatu Barada Niktu!Ahmad Samir , Klaatu Barada Niktu!Ahmad Samir , Klaatu Barada Niktu!Ahmad Samir , Klaatu Barada Niktu!
 
Posts: 114
Karma: 5246
Join Date: Jul 2010
Device: none
Quote:
Originally Posted by kiwidude View Post
I was fortunate that as stated above the div tags were not nested, so using that checkbox did what I wanted. Thanks also for the heads up on the "auto tag cleanup" possibility too, I'm very new to Sigil so just finding out some of it's "tricks" (and unexpected quirks sometimes because of them).

How often does Sigil get released/updated? Like any other software there are a bunch of minor things that either annoy by omission or behave in an way that means a lot of repeated keyboard/mouse swapping actions I know could be streamlined. Is it worth diving into the source to hack around or should I just be patient?
I'd say often; check the changelog: https://sigil.googlecode.com/hg/ChangeLog.txt
Ahmad Samir is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Regular Expression Help Azhad Calibre 86 09-27-2011 03:37 PM
Regular Expression Help iKarampa Calibre 13 12-15-2010 08:17 AM
Regular expression help krendk Calibre 4 12-04-2010 05:32 PM
Regular Expression Help smartmart Calibre 5 10-17-2010 06:19 AM
Help with the regular expression Dysonco Calibre 9 03-22-2010 11:45 PM


All times are GMT -4. The time now is 03:17 AM.


MobileRead.com is a privately owned, operated and funded community.