MobileRead Forums - View Single Post - Regular expressions, Calibre and you- an introduction (Archived)

kacir · 11-10-2010, 05:09 AM

Quote:

Originally Posted by bucsie

I am not a newbie with regex-es, but it's difficult to follow, and there are a couple of things I don't understand, like what is this for: (?<= or this one: (?=

those funny non-standard RE constructs are described at
http://docs.python.org/library/re.html

Code:


(?i)   # switch ignorecase on
(?<=<hr>)   # lookbehind assertion. It means the following RE can only match if <hr> precedes it. The <hr> string will not be part of the resulting match
(      # beginning of the main RE
   (   # beginning of Group A Re
      \s*<a name=\d+></a>
      (
         (<img.+?>)*   # +? is non greedy quantifier. It means match as little characters as possible, but at least one character. I personally would have written (<img[^>]+>)* here (see the first post for explanation)

         <br>
         \s*
       )?
      \d+
      <br>
      \s*
      .*?   # *? is non greedy quantifier. It means match as little characters as possible 
      \s*
   )   # end of of Group A Re
   |   # Group A OR Group B will be matched
   (   # beginning of Group A Re
      \s*
      <a name=\d+></a>
      (
         (<img.+?>)*   # +? is non greedy quantifier. It means match as little characters as possible, but at least one character. I personally would have written (<img[^>]+>)* here 
         <br>
         \s*
      )?
      .*?   # *? is non greedy quantifier. It means match as little characters as possible 
      <br>
      \s*
      \d+
   )   end of of Group B Re
)      # end of the main RE
(?=<br>)   # lookahead assertion. It means that the preceding RE can only match if it is followed by a <br>. The <br> string will not be part of the resulting match