MobileRead Forums - View Single Post - Regular expressions, Calibre and you- an introduction (Archived)

chaley · 09-21-2010, 03:23 AM

Thank you for writing this!

Comments below. Some are pedantic, but I can't help it.

Others are personal preference. All can be ignored.

Quote:

Originally Posted by Manichean

What on earth is a regular expression?
A regular expression is a way to describe a particular string of characters (string for short). Technical note: I'm using string here in the sense it is used in programming languages: a string of two or more characters, characters including actual characters, numbers, punctuation and so-called whitespaces (linebreaks, tabulators etc.). It gets complicated because regular expressions allow for variations in the strings it matches, so one expression can match multiple strings. More on that in a bit.

I regular-expression land (and in many programming languages), a single character is a string.

Quote:

Care to explain?
Well, that's why we're here. First, this is the most important concept in regular expressions: A string in itself is a regular expression that matches itself. That is to say, if I wanted to match the string "Hello, World!" using a regular expression, the regular expression to use would be

Code:

Hello, World!

Do you want to mention that 'H' and 'h' are different characters? Perhaps you do further down.

Quote:

And yes, it really is that simple. A word of warning: There are a handful of special characters that have some special function in regular expressions.
...

I would suggest that this complexity be pushed down in the document. I the poor reader don't know yet why I care. Introduce the need for escaping when the problem arises. Instead, give examples here. You might introduce ignoring case at this point in the examples.

Quote:

My head is spinning...
Already? We're only just getting to the good stuff. Okay, take a breath and relax... feeling better? I promise, I'll try to take it slow and keep it simple. Remember where I said that regular expressions can match multiple strings? This is were it gets a little more complicated.

You might want to start with a simpler example. One might be recognizing a particular author, say Pierre-Yves Trudeau. This author might appear as Pierre Yves Trudeau, P Y Trudeau, Pierre Trudeau, or P. Y. Trudeau. You decide that anything starting with P and ending with ' Trudeau' should match. This introduces '.' and quantifiers.

Quote:

Say, as a somewhat more practical exercise, the ebook you wanted to convert had a nasty footer counting the pages, like "Page 5 of 423". Obviously the page number would rise from 1 to 423, thus you'd have to match 423 different strings, right, Wrong, actually: regular expressions allow you to define groups of characters that are matched: To define a group, you put all the characters you want to be in the group into square brackets. So, for example, the group

Code:

[abc]

would match either the character "a", "b" or "c". Groups will always only match one of the characters in the group.

Unless the group is [^abc], in which case it will always match characters that are *not* in the group.

Quote:

Groups "understand" character ranges, that is, if you wanted to match all the lower case characters, you'd use the group

Code:

[a-z]

, for lower- and uppercase characters you'd use

Code:

[a-zA-Z]

and so on.

Consider whether or not you want to introduce the shorthand character classes. Perhaps not here, but maybe somewhere? I mean: '\d'==[0-9],
'\D'==[^0-9], \s== (set of whitespace) (this one is important), \w==[a-zA-Z0-9_] (Note: I see that you did this further down.

)

Quote:

Got the idea? ...
It works like this: Some of the special characters, "+", "?" and "*", repeat the character or group preceding them. These characters are called wildcards or quantifiers.

You might want to be precise with your wording. 'Using 'group' will get you into trouble eventually, when it gets confused with grouping for alternation (or) and for backreferences. I suggest that you use 'set' or 'class' for [] expressions, and reserve the word 'group' for parenthesized expressions.

You might also want to introduce the word 'element', which means a character or class or (eventually) group. Quantification applies to the previous element. (The computer scientist in me wants to get into recursion, but that would be a disaster.

)

Quote:

To be more precise, "?" matches 0 or 1 of the preceding character/group,

This is where you could use 'element'.

Quote:

...
I know what you're thinking, and you're right: If you use that in the above case of matching page numbers, wouldn't that be the single one expression to match all the page numbers? Yes, the expression

Code:

Page [0-9]+ of 423

would match every page number in the book! And then some, but that's not the concern here.

I wouldn't include the last sentence. Even I don't understand what you are trying to say.

Quote:

A note on these quantifiers: They generally try to match as much text as possible, so be careful when using them. This behaviour is called "greedy quantifiers"-

The behaviour is 'greedy'. The quantifier specifies whether or not the behavior is greedy.

Quote:

I'm sure you get why. This gets problematic when you, say, try to match a tag. Consider, for example, the string "<p class="calibre2">Title here</p>" and

Be careful with your choice of delimiters. Delimiting by " when there are embedded " can lead to confusion. I suggest that you use the CODE tags here, as you have most other places, and not delimit the string at all.

Quote:

let's say you'd want to match the opening tag (the part between the first pair of angle brackets, a little more on tags later). You'd think that the expression

Code:

<p.*>

would match that tag, but actually, it matches the whole string! (The character ".", as noted before,

I don't see where it was noted before.

Quote:

is a special character. It matches anything except linebreaks,

I know you don't want to introduce DOTALL here, but I want to make sure you know about it. If DOTALL is in effect, the dot will match line endings.

Quote:

so, basically, the expression

Code:

.*

would match any single line you can think of. That's less useful than it may seem.)

Why is it less useful? I suggest you don't confuse things here. Show why it can be less useful when the problem arises.

Quote:

Instead, try using

Code:

<p.*?>

which makes the quantifier "*" non-greedy.

You might want to explain again the difference between greedy and non-greedy, because the concept is so important. To paraphrase LOLcats and icanhascheezburger: Greedy: I eatz all the cheezburgers, savin none fur yu. Non-greedy: I eatz one cheezburger, savin the restz fur yu.

Quote:

...
...
The book you're converting has "Title" written on every odd page and "Author" written on every even page. Looks great in print, right? But in ebooks, it's annoying. You can group whole expressions in normal parentheses, and the character "|" will let you match either the expression to its right or the one to its left.

See, now we have ambiguity in the term 'group'

Quote:

...Now we make things simpler by using the pipe ("|" is called the pipe character): If you use the expression

Code:

(Title|Author)

you'll either get a match for "Title" (on the odd pages) or you'd match "Author" (on the even pages). Well, wasn't that easy?

It is called the pipe on *nix systems. No where else.

You might consider calling it a 'vertical bar', or 'bar' for short. I also might put the 'or' above in caps, or bold, or something to draw attention to what the bar is doing.

Quote:

...
and while you're at it, rememper to escape special characters,

Quote:

s/rememper/remember/

...
Be careful if your conversion source has tags like this example:

Code:

"Maybe, but the cops feel like you do, Anita. What's one more dead vampire? New laws don't change that." </p><p class="calibre4">
<b class="calibre2">Generated by ABC Amber LIT Conv<a href="http://www.processtext.com/abclit.html" class="calibre3">erter, http://www.processtext.com/abclit.html</a></b></p><p class="calibre4">
It had only been two years since Addison v. Clark. The court case gave us a revised version of what life was

(Shamelessly ripped out of this thread.) You'd have to remove some of the tags as well.

At this point you are starting a tutorial on HTML. Do you really want to do that? Perhaps you do...

Quote:

...
Also note that Calibre tries to repair damaged code after doing the header/footer removal.
Secondly, ....

The 'Firstly' was a long time back. You might want to make the transition more explicit, such as 'Now lets look at another use of regexps in calibre' or some such.

Quote:

you can use regular expressions to extract metadata from filenames. You can find this feature in the "Adding books" part of the settings.

Assuming you aren't bored with writing, a few examples would be good here. Then the next paragraph would be better situated.

Quote:

There's a special feature here: You can use field names for metadata fields, for example (?P<title>) would indicate that calibre uses this part of the string as book title.
...
Edit: added greedy quantifiers, some useful escape sequences, string groups, warning at the end. Still to come: some more practical examples.

My opinion: what you are doing here is very good. You are situating a complicated topic within the environment it is used, explaining the parts that cover the vast majority of the cases, and are doing so in a colloquial style. Good stuff.