Reg-ex help...?

ElMiko · 12-05-2011, 02:43 PM

Hi, all. I'm back for some additional reg-ex instruction...

I'm trying to remove <div> tags from a document in bulk, but can't seem to figure out what expression I should be using to find them.

Here's a sample of the code I'm working on:

Spoiler:

Now, the expression I used (with the intent of replacing it with "\1") was:

Code:

<div class="calibre1">([^<]*)</div>

Naturally, it didn't work ("no results found"). I think I discovered the reason. the "<" in "([^<]*)" is triggered by the first instance of that character, i.e. the "<" in "<p class...". The bad news is that knowing the problem hasn't helped me find the solution. I've tried a bunch of other iterations that either match too much (the entire document) or nothing at all.

Can someone let me know where exactly my brain is letting me down?

Jabby · 12-05-2011, 03:26 PM

Find: <div class="calibre1">(.*)</div>
Replace: \1

Should do it.

Regard- John

Serpentine · 12-05-2011, 03:28 PM

If you want to get rid of all divs, its easy enough to just delete all div tags. This is most likely fine for most uses, since paragraphs and such dont need to be contained in a div, but be careful if it causes title/forward pages to get a bit strange if there's lots of silly CSS.

Anyway, just use:

Code:

</?div\b[^<>]*>

And replace with nothing.

ElMiko · 12-05-2011, 03:42 PM

Thanks, guys!

@Jabby - Unfortunately, I tried that one, too. It just selects the entire document! Yikes!

@Serpentine - You say it's easy enough to just delete all the div tags. And I noticed that that reg-ex will do just that, but did you mean there's an easy way to delete all div tags without writing reg-ex? As always, if i could impose on you to explain part of your code, too, I'd be most grateful. Specifically: "\b[^<>]*". Thanks

Serpentine · 12-05-2011, 04:13 PM

Quote:

Originally Posted by ElMiko

did you mean there's an easy way to delete all div tags without writing reg-ex? As always, if i could impose on you to explain part of your code, too, I'd be most grateful. Specifically: "\b[^<>]*". Thanks

Nope, there's no direct XML manipulation like that in Sigil - I just meant that rather than replacing the div tags with their content, just deleting the tags themselves is easier.

\b[^<>]* is just a 'nice' ways of dealing with tags and attributes.
\b matches either end of a word. A word is just anything that matches \w+ generally.
The \b stops matches where there's only a partial tag, it's a habit from when you are searching for something like a <p> tag, you need to be careful to avoid <pre> tags.

Code:

<p([^<>]*)> // will match both <p yup="1"> and <pre something="wat">
<p\b[^<>]*> // will match p's but not pre's.

Using [^<>]*> rather than the more common [^>]*> is a measure to avoid destroying badly formatted tags, it's not a huge problem, but if a closing > has been removed by mistake, this will stop it matching the content and following tag(s).

Code:

Using the sample : <p Some text here</p>
</?p\b[^<>]*> : <p Some text here</p>
</?p\b[^>]*> : <p Some text here</p>

Not a very good example, but with nested tags, you can run into some pretty nasty stuff - can always avoid it by validating tho

ElMiko · 12-05-2011, 04:28 PM

@Serpentine - Thank you for your patience. Sincerely. I make a real effort not to ask questions whose answers I can extrapolate from previous answers to previous questions that I've asked. The only way to reliably do that is to understand why those previous answers worked the way they did. When the more experienced users (such as yourself) break down the reg-ex logic, it's truly invaluable to me. So, again: sincerely grateful.

st_albert · 12-05-2011, 07:40 PM

Quote:

Originally Posted by ElMiko

Thanks, guys!

@Jabby - Unfortunately, I tried that one, too. It just selects the entire document! Yikes!

Did you have the "minimal matching" option checked?

ElMiko · 12-05-2011, 07:44 PM

@ st_albert - nope. I'll try that on the next file i'm messing with. I remember looking up what "minimal matching" meant in the Sigil tutorial, but I guess I didn't/don't really understand. Could you explain it to me?

st_albert · 12-05-2011, 08:30 PM

Quote:

Originally Posted by ElMiko

@ st_albert - nope. I'll try that on the next file i'm messing with. I remember looking up what "minimal matching" meant in the Sigil tutorial, but I guess I didn't/don't really understand. Could you explain it to me?

I'll try. Basically, with "minimal matching", the selected string will be the shortest one possible that matches the search pattern. Without "minimal matching" the string will be the LONGEST one possible that matches the pattern. Since the pattern Jabby proposed,

Code:

find:  <div>(.*)</div>

contains a "match any number of character" part [i.e. .*] without minimal matching, the string will match from the first <div> all the way to the LAST </div>. Whereas with minimal matching enabled, the selection will stop at the FIRST </div> after the <div>.

Probably Serpentine could explain this more succinctly.

theducks · 12-05-2011, 08:30 PM

Quote:

Originally Posted by ElMiko

@ st_albert - nope. I'll try that on the next file i'm messing with. I remember looking up what "minimal matching" meant in the Sigil tutorial, but I guess I didn't/don't really understand. Could you explain it to me?

KISS or don't be greedy with the match.
I leave it ticked, only Case gets enabled when I only want the exact case to match

ElMiko · 12-05-2011, 10:29 PM

@st_albert - That was perfectly clear. Thank you. I had always simply assumed it was selecting the entire document, but your explanation makes total sense. I'll be sure keep that box checked from now on.

---

For the record, I am also not agree with topicstarter. Topicstarter excessive political opinionation on regular expressions. I am disappoint with topicstarter.

Toxaris · 12-06-2011, 12:41 AM

Quote:

Originally Posted by RokkyR

I'm not fully agreed with topicstarter

Very interesting first post. Let me ponder about this...

12-05-2011, 02:43 PM	#1
ElMiko Addict Posts: 320 Karma: 56788 Join Date: Jun 2011 Device: Kindle	Reg-ex help...? Hi, all. I'm back for some additional reg-ex instruction... I'm trying to remove <div> tags from a document in bulk, but can't seem to figure out what expression I should be using to find them. Here's a sample of the code I'm working on: Spoiler: <div class="calibre1"> <p class="calibre2"><span class="none">Some text that I want to keep.</span></p> </div> <div class="calibre1"> <p class="calibre5"><span class="none1">Some DIFFERENT text that I want to keep.</span></p> </div> Now, the expression I used (with the intent of replacing it with "\1") was: Code: <div class="calibre1">([^<])</div> Naturally, it didn't work ("no results found"). I think I discovered the reason. the "<" in "([^<])" is triggered by the first instance of that character, i.e. the "<" in "<p class...". The bad news is that knowing the problem hasn't helped me find the solution. I've tried a bunch of other iterations that either match too much (the entire document) or nothing at all. Can someone let me know where exactly my brain is letting me down?

12-05-2011, 03:28 PM	#3
Serpentine Evangelist Posts: 416 Karma: 1045911 Join Date: Sep 2011 Location: Cape Town, South Africa Device: Kindle 3	If you want to get rid of all divs, its easy enough to just delete all div tags. This is most likely fine for most uses, since paragraphs and such dont need to be contained in a div, but be careful if it causes title/forward pages to get a bit strange if there's lots of silly CSS. Anyway, just use: Code: </?div\b[^<>]*> And replace with nothing.

12-05-2011, 10:29 PM	#11
ElMiko Addict Posts: 320 Karma: 56788 Join Date: Jun 2011 Device: Kindle	iiiiiiiiiiiiiiiiiiiiis @st_albert - That was perfectly clear. Thank you. I had always simply assumed it was selecting the entire document, but your explanation makes total sense. I'll be sure keep that box checked from now on. --- For the record, I am also not agree with topicstarter. Topicstarter excessive political opinionation on regular expressions. I am disappoint with topicstarter. Last edited by ElMiko; 12-06-2011 at 10:19 AM. Reason: changed grammar in title for the sake of consistency

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Adobe Reg problem on PE	Gremalkin	enTourage eDGe	5	09-02-2011 03:01 PM
Reg Validate EPUB documents Errors.	gsp	ePub	3	08-13-2011 05:02 AM
Reg expression for importing	Debby	Library Management	2	02-17-2011 11:20 AM
eBooks: What to read on which reader? El Reg	m-reader	News	4	11-23-2009 12:50 PM
Reg reviews iRex DR1000S	HarryT	News	5	07-24-2009 05:32 PM

12-05-2011, 03:26 PM	#2
Jabby Jr. - Junior Member Posts: 586 Karma: 2000358 Join Date: Aug 2010 Location: Alabama Device: Archos, Asus, HP, Lenovo, Nexus and Samsung tablets in 7,8 and 10"	Find: <div class="calibre1">(.*)</div> Replace: \1 Should do it. Regard- John

12-05-2011, 03:42 PM	#4
ElMiko Addict Posts: 320 Karma: 56788 Join Date: Jun 2011 Device: Kindle	Thanks, guys! @Jabby - Unfortunately, I tried that one, too. It just selects the entire document! Yikes! @Serpentine - You say it's easy enough to just delete all the div tags. And I noticed that that reg-ex will do just that, but did you mean there's an easy way to delete all div tags without writing reg-ex? As always, if i could impose on you to explain part of your code, too, I'd be most grateful. Specifically: "\b[^<>]*". Thanks

12-05-2011, 04:28 PM	#6
ElMiko Addict Posts: 320 Karma: 56788 Join Date: Jun 2011 Device: Kindle	@Serpentine - Thank you for your patience. Sincerely. I make a real effort not to ask questions whose answers I can extrapolate from previous answers to previous questions that I've asked. The only way to reliably do that is to understand why those previous answers worked the way they did. When the more experienced users (such as yourself) break down the reg-ex logic, it's truly invaluable to me. So, again: sincerely grateful.

12-05-2011, 07:44 PM	#8
ElMiko Addict Posts: 320 Karma: 56788 Join Date: Jun 2011 Device: Kindle	@ st_albert - nope. I'll try that on the next file i'm messing with. I remember looking up what "minimal matching" meant in the Sigil tutorial, but I guess I didn't/don't really understand. Could you explain it to me?