View Single Post
Old 11-10-2010, 05:09 AM   #78
kacir
Wizard
kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.
 
kacir's Avatar
 
Posts: 3,463
Karma: 10684861
Join Date: May 2006
Device: PocketBook 360, before it was Sony Reader, cassiopeia A-20
Quote:
Originally Posted by bucsie View Post
I am not a newbie with regex-es, but it's difficult to follow, and there are a couple of things I don't understand, like what is this for: (?<= or this one: (?=
those funny non-standard RE constructs are described at
http://docs.python.org/library/re.html
Code:

(?i)   # switch ignorecase on
(?<=<hr>)   # lookbehind assertion. It means the following RE can only match if <hr> precedes it. The <hr> string will not be part of the resulting match
(      # beginning of the main RE
   (   # beginning of Group A Re
      \s*<a name=\d+></a>
      (
         (<img.+?>)*   # +? is non greedy quantifier. It means match as little characters as possible, but at least one character. I personally would have written (<img[^>]+>)* here (see the first post for explanation)

         <br>
         \s*
       )?
      \d+
      <br>
      \s*
      .*?   # *? is non greedy quantifier. It means match as little characters as possible 
      \s*
   )   # end of of Group A Re
   |   # Group A OR Group B will be matched
   (   # beginning of Group A Re
      \s*
      <a name=\d+></a>
      (
         (<img.+?>)*   # +? is non greedy quantifier. It means match as little characters as possible, but at least one character. I personally would have written (<img[^>]+>)* here 
         <br>
         \s*
      )?
      .*?   # *? is non greedy quantifier. It means match as little characters as possible 
      <br>
      \s*
      \d+
   )   end of of Group B Re
)      # end of the main RE
(?=<br>)   # lookahead assertion. It means that the preceding RE can only match if it is followed by a <br>. The <br> string will not be part of the resulting match


Last edited by kacir; 11-10-2010 at 05:14 AM.
kacir is offline