MobileRead Forums - View Single Post - Regular expressions, Calibre and you- an introduction (Archived)

kacir · 09-23-2010, 05:01 PM

Great post.
I have a few suggestions.
At the very beginning of the first post you might put something like:
This is fourth version of the guide and it was amended using various suggestions in subsequent posts.
I do not suggest this to get some credit for a suggestion or two, but without this explanation a first time reader might find some of the following posts ... superfluous.

I also suggest that the text of this post should be included in official Calibre documentation, or at the very least, Calibre documentation should point to this post.
Another good place for preserving this thread would be our Wiki.
Regular expressions are very widespread and yet, a GOOD documentation, explaining Regular Expressions from a point of view of beginner are relatively hard to find. The documentation for programming language or a text editor is usually written from the point of view of Reference manual describing all the options in a rather terse, concentrated manner. As you see for yourself, writing even relatively simple description of a few selected features is quite lengthy.

My favourite tool For using Regular Expressions is Vim text editor. It has also one of the very best documentations I have seen. Unfortunately, it has a little different syntax than Python REs, but the principle remains the same.

----------------

Now, let's see how we can improve the introduction.

First of all, now that you have introduces the Pipe '|' for providing different branches, you have to explain the rules of precedence a little bit ;-)
A pipe - '|' has the lowest precedence. So if you write RE 'abcd|efgh' it will match the whole 'abcd' string OR 'efgh' and not 'abc' followed by either 'd' or 'e' and then followed by 'fgh'. If we wanted to do that, we would have to write 'abc(d|e)fgh'.
I know, it should be obvious from your example, but there are a few interesting twists here.

Now, I can hear you asking: So now, instead of '[1234]' I can write '(1|2|3|4)'. Well, yes, you can. BUT! '[1234]+' will match strings like '1212' or '444' or '34' - literally any of members of the members of the group [1234] followed by any other member of the group. '(1|2|3|4)+', on the other hand, will match '111', or '22' or '44444', but not '12', or '34'. Because the Regular Expression parser when matching '34' will select '3' out of '(1|2|3|4)' and the plus quantifier will want to match the selected '3' again and will fail.

Let's get back to the precedence rules.
Quantifiers apply only to the preceding atom.
An atom (and that should have been explained at the very beginning, but we did not want to scare the reader away

) is:
- a letter, such as 'a', 'q', '2' or ';' that simply matches itself.
- dot '.' that stands for any character
- special escape sequence, such as '\t' - a tabulator, or '\D' - non digit character
- a group, such as [a-zA-Z] or [^>]
- if you have several atoms, you want to make into one atom, you can enclose them to a pair of parenthesis, such as (<[^>]+>)
So. If I write RE 'ab+', it will match 'ab', or 'abbbbbb', but not 'abab', because the plus quantifier only applies to the preceding atom. If we wanted to match 'abab' or 'ababab' we would need to write Regular expression like this: '(ab)+'

I will continue later. At this moment I go to sleep, but there are a few things that need to be explained, such as:
- referencing parenthesis using \1, \3 notation
- anchors
- interesting extensions (? ... )
- more quantifiers {m,n} (not that I consider them particular useful in Regular Expression typically used in Calibre.)

We should also develop a few very typical examples, useful for ordinary user, such as processing filename that *might* contain series information (here we will use the pipe '|' to process several branches, with and without series info)
So, please, if you want to solve your typical problem, post it here, so we could develop some examples using real-life situations.

Disclaimer: Please feel free to use any portion of my text for improvement of the "introduction"

09-23-2010, 05:01 PM	#38
kacir Wizard Posts: 3,450 Karma: 10484861 Join Date: May 2006 Device: PocketBook 360, before it was Sony Reader, cassiopeia A-20	Great post. I have a few suggestions. At the very beginning of the first post you might put something like: This is fourth version of the guide and it was amended using various suggestions in subsequent posts. I do not suggest this to get some credit for a suggestion or two, but without this explanation a first time reader might find some of the following posts ... superfluous. I also suggest that the text of this post should be included in official Calibre documentation, or at the very least, Calibre documentation should point to this post. Another good place for preserving this thread would be our Wiki. Regular expressions are very widespread and yet, a GOOD documentation, explaining Regular Expressions from a point of view of beginner are relatively hard to find. The documentation for programming language or a text editor is usually written from the point of view of Reference manual describing all the options in a rather terse, concentrated manner. As you see for yourself, writing even relatively simple description of a few selected features is quite lengthy. My favourite tool For using Regular Expressions is Vim text editor. It has also one of the very best documentations I have seen. Unfortunately, it has a little different syntax than Python REs, but the principle remains the same. ---------------- Now, let's see how we can improve the introduction. First of all, now that you have introduces the Pipe '\|' for providing different branches, you have to explain the rules of precedence a little bit ;-) A pipe - '\|' has the lowest precedence. So if you write RE 'abcd\|efgh' it will match the whole 'abcd' string OR 'efgh' and not 'abc' followed by either 'd' or 'e' and then followed by 'fgh'. If we wanted to do that, we would have to write 'abc(d\|e)fgh'. I know, it should be obvious from your example, but there are a few interesting twists here. Now, I can hear you asking: So now, instead of '[1234]' I can write '(1\|2\|3\|4)'. Well, yes, you can. BUT! '[1234]+' will match strings like '1212' or '444' or '34' - literally any of members of the members of the group [1234] followed by any other member of the group. '(1\|2\|3\|4)+', on the other hand, will match '111', or '22' or '44444', but not '12', or '34'. Because the Regular Expression parser when matching '34' will select '3' out of '(1\|2\|3\|4)' and the plus quantifier will want to match the selected '3' again and will fail. Let's get back to the precedence rules. Quantifiers apply only to the preceding atom. An atom (and that should have been explained at the very beginning, but we did not want to scare the reader away ) is: - a letter, such as 'a', 'q', '2' or ';' that simply matches itself. - dot '.' that stands for any character - special escape sequence, such as '\t' - a tabulator, or '\D' - non digit character - a group, such as [a-zA-Z] or [^>] - if you have several atoms, you want to make into one atom, you can enclose them to a pair of parenthesis, such as (<[^>]+>) So. If I write RE 'ab+', it will match 'ab', or 'abbbbbb', but not 'abab', because the plus quantifier only applies to the preceding atom. If we wanted to match 'abab' or 'ababab' we would need to write Regular expression like this: '(ab)+' I will continue later. At this moment I go to sleep, but there are a few things that need to be explained, such as: - referencing parenthesis using \1, \3 notation - anchors - interesting extensions (? ... ) - more quantifiers {m,n} (not that I consider them particular useful in Regular Expression typically used in Calibre.) We should also develop a few very typical examples, useful for ordinary user, such as processing filename that might contain series information (here we will use the pipe '\|' to process several branches, with and without series info) So, please, if you want to solve your typical problem, post it here, so we could develop some examples using real-life situations. Disclaimer: Please feel free to use any portion of my text for improvement of the "introduction"