Register Guidelines E-Books Search Today's Posts Mark Forums Read

 MobileRead Forums Regular expressions, Calibre and you- an introduction (Archived)

09-21-2010, 05:02 AM   #16
Manichean
Wizard

Posts: 3,130
Karma: 80584
Join Date: Feb 2008
Location: Germany
Device: Cybook Gen3
Quote:
 Originally Posted by chaley You can use flags, but you must use embedded syntax.
Oh, that works? I feel stupid now- somehow, that never occured to me.

Quote:
 Originally Posted by chaley [...]
That's stretching my knowledge of theoretical computer sciences... I've heard lectures on program verification, algorithms and data structures, but that's about it. I'll have to think about what you wrote there.

09-21-2010, 06:35 AM   #17
ldolse
Wizard

Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
Quote:
 Originally Posted by Manichean Oh, that works? I feel stupid now- somehow, that never occured to me.
You shouldn't feel stupid about not knowing about embedded syntax for flags/qualifiers - the amount of google searching I had to do to find the docs on this was ridiculous. They can be found, but only if you've already figured out the exact descriptive qualifiers used in the python documentation. It's not in the commonly linked python regex doc pages, which talk about using re.compile to pass flags, and re.compile isn't available to a normal Calibre end user.

That, and all the flags are specific to python, and for the most part your tutorial has been about generic regex behavior. Probably best to stay generic, though that may be debatable...

 09-21-2010, 10:50 PM #18 Scott Nielsen Groupie   Posts: 154 Karma: 112134 Join Date: May 2009 Location: Kuala Lumpur Device: iPad, K3, K4, T1 This thread has been very enlightening. My thanks to OP and everybody who contributed further; I've learned a lot.
 09-21-2010, 10:57 PM #19 TonytheBookworm Addict     Posts: 264 Karma: 62 Join Date: May 2010 Device: kindle 2, kindle 3, Kindle fire Reg expressions have always been a little complex for me to understand. Great tutorial by the way. One thing that I have found that works really well and I use it with a lot of the recipes i write for calibre is this site here. http://www.txt2re.com/index.php3 you can type the string that you want to search for. then hit the language which in our case would be python and it spits out the code to find the regex.
 09-22-2010, 04:08 AM #20 Manichean Wizard     Posts: 3,130 Karma: 80584 Join Date: Feb 2008 Location: Germany Device: Cybook Gen3 Thanks to both of you. As for that regexp generator, as far as I see, you'd have to manually paste the expression together from the single elements it generates in code, yes? But still, that might be a handy tool.
 09-23-2010, 08:31 AM #21 Manichean Wizard     Posts: 3,130 Karma: 80584 Join Date: Feb 2008 Location: Germany Device: Cybook Gen3 Edited again: for style, I tried to clarify distinction between use of parentheses and square brackets (groups vs. sets), added notes on strings in general, added some examples, included some flags. Still not sure what to do about that part about special characters and escape sequences. Chaley, I generally agree that it's out of place where it is now, but I couldn't figure out where else to put it. I think I'm gonna go ahead and rewrite it into the respective portions, but that's for another pending edit. Also, still to come: even more examples. (Things I learned while writing this edit: Calibre likes file extensions in the test field for importing file expressions, and Notepad++ has an entirely different understanding for what \s means than Python does. Next time, I'm going to use Python to test examples right away.)
09-23-2010, 08:44 AM   #22
DoctorOhh
US Navy, Retired

Posts: 9,038
Karma: 12768811
Join Date: Feb 2009
Location: North Carolina
Device: Nexus 7
Quote:
 Originally Posted by Manichean Notepad++ has an entirely different understanding for what \s means than Python does. Next time, I'm going to use Python to test examples right away.)
I'm curious, did you set the language in Notepad++ to Python?

09-23-2010, 09:13 AM   #23
Manichean
Wizard

Posts: 3,130
Karma: 80584
Join Date: Feb 2008
Location: Germany
Device: Cybook Gen3
Quote:
 Originally Posted by dwanthny I'm curious, did you set the language in Notepad++ to Python?
I'm talking about using the search feature with regular expressions, which the programs help notes to use the Scintilla regexp engine and being fixed to a per-line match. Does changing the language change this behaviour?

09-23-2010, 09:24 AM   #24
Manichean
Wizard

Posts: 3,130
Karma: 80584
Join Date: Feb 2008
Location: Germany
Device: Cybook Gen3
Quote:
 Originally Posted by chaley The general case cannot be solved with regular expressions because REs don't have the notion of 'stack'. Said another way, and getting a bit formal, all REs by definition can be translated into a deterministic finite state machine. The important part here is that the number of states is known from the RE, and is fixed for all utterances (text to be matched). Parsing utterances in a palindromic language requires a state for each letter up to the center point so the machine can match the right letter after the center point. Such a machine requires len(utterance)/2 states. Thus the number of states is unbounded, meaning that the grammar for the language cannot be described using an RE.
So, if I understand you correctly, the regular expression would be something like the starting state, each matching string would be it's own (end?) state, and the matching operation would provide transitions between the end states, yes? And because the number of states a regular expression can describe is finite, and, in the general case, len(utterance) is unknown, presumed to be in the range [0,\infty], a palindrome could not be described using a regular expression? If that's correct, I don't think I understand how the * and + quantifiers work, because they, as far as I understand, describe an utterance with len(utterance) \in [0,\infty] and len(utterance) \in [1,\infty], respectively.
(Sorry, I lapsed into some LaTeX there, but, seeing your background, I assume you understand.)

09-23-2010, 10:42 AM   #25
megachirops
Enthusiast

Posts: 31
Karma: 12
Join Date: Mar 2010
Device: Kindle 2, Kindle 3
Quote:
 Originally Posted by Manichean Notepad++ has an entirely different understanding for what \s means than Python does. Next time, I'm going to use Python to test examples right away.)
An online regex tester that I've had good luck with:
http://www.pythonregex.com/

09-23-2010, 11:10 AM   #26
chaley
"chaley", not "charley"

Posts: 7,569
Karma: 1971460
Join Date: Jan 2010
Location: France
Device: Many android devices
Quote:
 Originally Posted by Manichean So, if I understand you correctly, the regular expression would be something like the starting state, each matching string would be it's own (end?) state, and the matching operation would provide transitions between the end states, yes?
I don't think this is quite right.

First, some words on finite state machines (FSMs). An FSM consists of a set of states, connected by arcs that stand for events. Lets say that these events 'label' the arc. In the kind of machine we are talking about (deterministic), an event can appear on at most one arc out of a given state. Traversing an arc consumes the event. Events that are labels for an arc leaving a particular state cannot be accepted by that state, and their occurrence is an error. The machine is constructed in advance from the pattern, and cannot change depending upon input.

Thus the regular expression 'abc' would look something like

start __a__ state1 __b__ state2 __c__ accept

The regular expression for 'ab*c' would look something like

Code:

start __a__ state1 __b__ state2 __c__ accept
|_b_|

Quote:
 And because the number of states a regular expression can describe is finite,
The number of states an FSM can *contain* is finite. The number of states an FSM can *describe* is infinite.
Quote:
 and, in the general case, len(utterance) is unknown, presumed to be in the range [0,\infty], a palindrome could not be described using a regular expression?
Yes. But as we will see below, it is not the length that is the problem, it is the matching of the characters.
Quote:
 If that's correct, I don't think I understand how the * and + quantifiers work, because they, as far as I understand, describe an utterance with len(utterance) \in [0,\infty] and len(utterance) \in [1,\infty], respectively.
Yes, they do. But they have no memory. There is no way to match in a general fashion (using one machine) the strings abba, abcba, and abbbba, all of which are palindromes. The problem is that if we don't know what to do until we get to the end and know what 'len' is. Once we know 'len', we can construct the right machine for that utterance. Alternatively, we can a stack, but that would most definitely not be an FSM, because the label on the arc would only be known during execution. In the first case, we can accept only 'a's after the second 'b', but in the second case we can accept a 'c' and in the third case another 'b'.

Of course, you can write a regexp if you know *all* the utterances that you must match. For example, the regexp 'abba|abcba|abbbba' works fine.

You might be interested in the paper at this link. It gets mathematical at times (lots of times), but it discusses how one translates REs used in searches to FSMs to really do the work.

Last edited by chaley; 09-23-2010 at 11:29 AM.

09-23-2010, 11:12 AM   #27
Manichean
Wizard

Posts: 3,130
Karma: 80584
Join Date: Feb 2008
Location: Germany
Device: Cybook Gen3
Quote:
 Originally Posted by megachirops An online regex tester that I've had good luck with: http://www.pythonregex.com/
I do have Python installed, it's just that, since I was editing the post in Notepad++, I was too lazy to open another windows. I learned from that.

09-23-2010, 11:27 AM   #28
Manichean
Wizard

Posts: 3,130
Karma: 80584
Join Date: Feb 2008
Location: Germany
Device: Cybook Gen3
Quote:
 Originally Posted by chaley Of course, you can write a regexp if you know *all* the utterances that you must match. For example, the regexp 'abba|abcba|abbbba' works fine.
I see what you did there
After reading your post, I believe my difficulties lie in understanding FSMs. Reading the introduction to the relevant Wikipedia entry obviously doesn't give an understanding of the subject... I'm going to read up on this, as this is relevant to my (private) interests, but it's going to have to be on the backburner for some while due to some Real Life interference.

Quote:
 Originally Posted by chaley You might be interested in the paper at this linkthis link.It gets mathematical at times (lots of times), but it discusses how one translates REs used in searches to FSMs to really do the work.
Thanks for the link. I think, once I refresh my memory on some of the things I learned in computer sciences, that I should be able to understand that.

09-23-2010, 11:54 AM   #29
Starson17
Wizard

Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
 Originally Posted by Manichean Thanks for the link. I think, once I refresh my memory on some of the things I learned in computer sciences, that I should be able to understand that.
I've never had a CS course (or even a programming course), but I'd like to say thanks for the link, too. It was an interesting read. As far as I'm concerned, Charles can set <professorial mode = "ON"> at any time. I always learn something new.

As to the problem, I'm always up for a challenge, particularly one that tells me I "will fail." I'll quote the problem statement from the professor:
Quote:
 For fun, try to write a regular expression that matches any palindrome. (http://en.wikipedia.org/wiki/Palindrome. Examples: abcdedcba or 'madam im adam' with spaces ignored.) You will fail.
Code:
.*
I believe it succeeds at matching "any palindrome."

09-23-2010, 11:55 AM   #30
Manichean
Wizard

Posts: 3,130
Karma: 80584
Join Date: Feb 2008
Location: Germany
Device: Cybook Gen3
Quote:
 Originally Posted by Starson17 The first rule of taking a test, even one for fun: read the problem statement very carefully. How about this regular expression for an answer: Code: .* I believe it succeeds at matching "any palindrome."
Yes, you're right. Hey, that means that all those theories chaley posted about are wrong! Quick, let's publish a paper!

 Tags regexp calibre tutorial