Thread: What a regex is
View Single Post
Old 05-05-2010, 01:25 AM   #1
Worldwalker
Curmudgeon
Worldwalker ought to be getting tired of karma fortunes by now.Worldwalker ought to be getting tired of karma fortunes by now.Worldwalker ought to be getting tired of karma fortunes by now.Worldwalker ought to be getting tired of karma fortunes by now.Worldwalker ought to be getting tired of karma fortunes by now.Worldwalker ought to be getting tired of karma fortunes by now.Worldwalker ought to be getting tired of karma fortunes by now.Worldwalker ought to be getting tired of karma fortunes by now.Worldwalker ought to be getting tired of karma fortunes by now.Worldwalker ought to be getting tired of karma fortunes by now.Worldwalker ought to be getting tired of karma fortunes by now.
 
Posts: 3,085
Karma: 722357
Join Date: Feb 2010
Device: PRS-505
What a regex is

We talk about regexes a lot, and I realized a while ago that some people are completely in the dark as to what we're talking about. In my response to our most recent troll, I had written a brief layman's explanation, but it was all for naught when Kovid rightfully closed the thread -- with an epic smackdown to the troll, to boot! While neither the thread nor the troll is any loss, I figured I ought to salvage my regex explanation. Full disclosure, by the way: I suck at writing regexes. Big ones scare me. But some basic familiarity with the concept is a good idea.

Quote:
Successful software does not require users to learn a programming language, which is basically what RegEx is.
So said the troll, and as with everything else he said, he was dead wrong.

There is no programming language called "RegEx". The term "regex" (in various forms of capitalization) is an abbreviation for the phrase "regular expression", which is a formal way of defining a pattern to be matched by whatever programming language is processing it.

Here's a human example: Imagine you have to look through a page full of data and find all of the dates that are mixed in with, I dunno, locations, sample numbers, whatever. You are told that the dates are always listed as dd/mm/yyyy. So you read through the great wall o'text, and every time you find something that fits the pattern, you mark it. In our little example, you would be the computer, and dd/mm/yyyy would be the regex.

Regexes don't really look like that, of course, but that's really all they are: patterns that a computer program matches against whatever is being examined. Here's a simple one (don't worry, it's not as scary as it looks): \d{5}(-\d{4})? That matches US postal codes in either 5-digit or 9-digit format. It looks like gibberish, but what it says is 5 of any digit, then a hyphen and 4 of any digit, with the last part optional. \d means "any digit from 0 to 9". {5} means "5 of whatever that last bit was" -- in this case, digits. Putting something in parentheses groups whatever is in the parentheses, just like in math. So if I tell you that - is just a literal hyphen, you can probably figure out what (-\d{4}) means.
Spoiler:
"a hyphen, followed by 4 digits."
And the final ? means that whatever precedes it (the expression in parentheses) is optional.

Mind you, regexes can get far more complicated than that. But no matter how convoluted the pattern gets, it's still a pattern, not a programming language. Just a pattern that a program tries to match to data. Writing one from scratch can be tricky, but thankfully the average person (even the average programmer) rarely has to. There are places like RegExLib to help out, including their nifty tester. When it comes to Calibre, the forums are full of masters of Regex-Fu. No, I'm not one of them, but maybe one of them will drop in and expand on my very brief explanation, especially as they relate to Calibre.
Worldwalker is offline   Reply With Quote