Quote:
Originally Posted by phossler
@eschwartz--
1. Can you explain how the negative look ahead works, including breaking down the pieces of the RE?
2. Many times when I'm cleaning an epub, removing unneeded 'class=" ... " ' in <span class="..."> I'll eventually end up with a lot of <span>.....</span> constructs. It appears that your RE is better than the more simplistic RE I was using to just remove them
Thanks
|
This website has a very good regex explanation, which is how I learned about it.
http://www.regular-expressions.info/lookaround.html
They provide a thorough explanation, and break down the examples.
For my example:
Code:
<span class="none2">((?:(?!<span).)*?)</span>
Search for
Code:
<span class="none2">inner text</span>
"inner text" itself is a little more complicated, though:
We capture everything as "\1" --inside is
The main search (finally

) is:
a non-capturing group, which is repeated zero or more times -- yes, we can repeat whole groups.
(plus a confusing "?" which is redundant (the start already makes it optional) and I seem to have copied it randomly from the original source
.)
This group contains the negative lookahead (a zero-length assertion)
which searches for the non-existence of "<span", followed by a dot-matches-all.
So, putting it all back together, the dot-match-all must be preceded by the negative lookahead, and this "any character other than part of a span tag" is then grouped and repeated zero or more times, then captured as "\1" to produce the "inner text" which should be saved from in between the span.