MobileRead Forums - View Single Post - Regular expressions, Calibre and you- an introduction (Archived)

chaley · 09-21-2010, 04:39 AM

Quote:

Originally Posted by Manichean

I know about flags, but as far as I know, Calibre doesn't allow for them to be used, am I right?

You can use flags, but you must use embedded syntax. From the python docs:

Code:

(?iLmsux)

    (One or more letters from the set 'i', 'L', 'm', 's', 'u', 'x'.) The group matches the empty string; the letters set the corresponding flags: re.I (ignore case), re.L (locale dependent), re.M (multi-line), re.S (dot matches all), re.U (Unicode dependent), and re.X (verbose), for the entire regular expression. (The flags are described in Module Contents.) This is useful if you wish to include the flags as part of the regular expression, instead of passing a flag argument to the re.compile() function.

Two things to note:
1) ignore case is turned on by default, and therefore cannot be turned off.
2) (in python) the flags affect the entire expression, even if they occur later in the expression.

So, to use DOTALL to match tags split across lines (<a tags are famous for this>, I would do something like '(?s)<.*?>'. Re.M is also incredibly useful, because it allows you to use anchored expressions that match in the middle of the document. For example, '(?im)^<a.*?\/a>$' would match hyperlink tags that start at the beginning of a line, end at the end of a line, but perhaps contain line endings.

Quote:

The thought here is that, judging from the posts we saw concerning the use of regexpes, at least some of the people wanting to use them have never seen HTML or anything similar. I wanted to explain what can be removed without going into any detail. I haven't decided whether to remove or rewrite this part, seeing how Calibre tries to correct broken syntax.

I think your reasoning to keep it is correct. This is indeed what people ask about. And in any event, it is better not to break the syntax then to hope calibre fixes it up correctly.

Quote:

By the way, concerning your comment on palindromes a while back: I think I see what you mean. I believe I've figured out how to match any palindrome of a given length not containing whitespaces (as in I couldn't match "madam im adam"), but that's about as far as I got.

Yea, known length palindromes are easy, because you can use group backreferences. Dealing with the spaces is a pain, yes, but done by consuming all spaces outside the grouping parentheses.

<professorial_mode>
The general case cannot be solved with regular expressions because REs don't have the notion of 'stack'. Said another way, and getting a bit formal, all REs by definition can be translated into a deterministic finite state machine. The important part here is that the number of states is known from the RE, and is fixed for all utterances (text to be matched). Parsing utterances in a palindromic language requires a state for each letter up to the center point so the machine can match the right letter after the center point. Such a machine requires len(utterance)/2 states. Thus the number of states is unbounded, meaning that the grammar for the language cannot be described using an RE.

Because of the above problem, compilers usually use multiple grammars. One describes the input alphabet (identifiers etc) and symbols, and can often be an RE. Another describes the order of symbols, and is almost always a non-regular context-free grammar. Sometimes there is are more grammars for certain constructs or for the optimizer.
</professorial_mode>