MobileRead Forums - View Single Post

eschwartz · 07-28-2014, 09:52 AM

Quote:

Originally Posted by phossler

@eschwartz--

1. Can you explain how the negative look ahead works, including breaking down the pieces of the RE?

2. Many times when I'm cleaning an epub, removing unneeded 'class=" ... " ' in <span class="..."> I'll eventually end up with a lot of <span>.....</span> constructs. It appears that your RE is better than the more simplistic RE I was using to just remove them

Thanks

This website has a very good regex explanation, which is how I learned about it.

http://www.regular-expressions.info/lookaround.html

They provide a thorough explanation, and break down the examples.

For my example:

Code:

<span class="none2">((?:(?!<span).)*?)</span>

Search for

Code:

<span class="none2">inner text</span>

"inner text" itself is a little more complicated, though:

Code:

((?:(?!<span).)*?)

We capture everything as "\1" --inside is

Code:

(?:(?!<span).)*?

The main search (finally

) is:

Code:

(?:(?!<span).)

a non-capturing group, which is repeated zero or more times -- yes, we can repeat whole groups.
(plus a confusing "?" which is redundant (the start already makes it optional) and I seem to have copied it randomly from the original source

.)

This group contains the negative lookahead (a zero-length assertion)

Code:

(?!<span)

which searches for the non-existence of "<span", followed by a dot-matches-all.

So, putting it all back together, the dot-match-all must be preceded by the negative lookahead, and this "any character other than part of a span tag" is then grouped and repeated zero or more times, then captured as "\1" to produce the "inner text" which should be saved from in between the span.