Quote:
Originally Posted by bucsie
I am not a newbie with regex-es, but it's difficult to follow, and there are a couple of things I don't understand, like what is this for: (?<= or this one: (?=
|
those funny non-standard RE constructs are described at
http://docs.python.org/library/re.html
Code:
(?i) # switch ignorecase on
(?<=<hr>) # lookbehind assertion. It means the following RE can only match if <hr> precedes it. The <hr> string will not be part of the resulting match
( # beginning of the main RE
( # beginning of Group A Re
\s*<a name=\d+></a>
(
(<img.+?>)* # +? is non greedy quantifier. It means match as little characters as possible, but at least one character. I personally would have written (<img[^>]+>)* here (see the first post for explanation)
<br>
\s*
)?
\d+
<br>
\s*
.*? # *? is non greedy quantifier. It means match as little characters as possible
\s*
) # end of of Group A Re
| # Group A OR Group B will be matched
( # beginning of Group A Re
\s*
<a name=\d+></a>
(
(<img.+?>)* # +? is non greedy quantifier. It means match as little characters as possible, but at least one character. I personally would have written (<img[^>]+>)* here
<br>
\s*
)?
.*? # *? is non greedy quantifier. It means match as little characters as possible
<br>
\s*
\d+
) end of of Group B Re
) # end of the main RE
(?=<br>) # lookahead assertion. It means that the preceding RE can only match if it is followed by a <br>. The <br> string will not be part of the resulting match