View Single Post
Old 07-28-2014, 09:52 AM   #16
eschwartz
Ex-Helpdesk Junkie
eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.
 
eschwartz's Avatar
 
Posts: 19,421
Karma: 85400180
Join Date: Nov 2012
Location: The Beaten Path, USA, Roundworld, This Side of Infinity
Device: Kindle Touch fw5.3.7 (Wifi only)
Quote:
Originally Posted by phossler View Post
@eschwartz--



1. Can you explain how the negative look ahead works, including breaking down the pieces of the RE?

2. Many times when I'm cleaning an epub, removing unneeded 'class=" ... " ' in <span class="..."> I'll eventually end up with a lot of <span>.....</span> constructs. It appears that your RE is better than the more simplistic RE I was using to just remove them

Thanks
This website has a very good regex explanation, which is how I learned about it.
http://www.regular-expressions.info/lookaround.html

They provide a thorough explanation, and break down the examples.

For my example:

Code:
<span class="none2">((?:(?!<span).)*?)</span>
Search for
Code:
<span class="none2">inner text</span>
"inner text" itself is a little more complicated, though:
Code:
((?:(?!<span).)*?)
We capture everything as "\1" --inside is
Code:
(?:(?!<span).)*?
The main search (finally ) is:
Code:
(?:(?!<span).)
a non-capturing group, which is repeated zero or more times -- yes, we can repeat whole groups.
(plus a confusing "?" which is redundant (the start already makes it optional) and I seem to have copied it randomly from the original source .)

This group contains the negative lookahead (a zero-length assertion)
Code:
(?!<span)
which searches for the non-existence of "<span", followed by a dot-matches-all.

So, putting it all back together, the dot-match-all must be preceded by the negative lookahead, and this "any character other than part of a span tag" is then grouped and repeated zero or more times, then captured as "\1" to produce the "inner text" which should be saved from in between the span.
eschwartz is offline   Reply With Quote