05-05-2010, 01:25 AM | #1 | |
Curmudgeon
Posts: 3,085
Karma: 722357
Join Date: Feb 2010
Device: PRS-505
|
What a regex is
We talk about regexes a lot, and I realized a while ago that some people are completely in the dark as to what we're talking about. In my response to our most recent troll, I had written a brief layman's explanation, but it was all for naught when Kovid rightfully closed the thread -- with an epic smackdown to the troll, to boot! While neither the thread nor the troll is any loss, I figured I ought to salvage my regex explanation. Full disclosure, by the way: I suck at writing regexes. Big ones scare me. But some basic familiarity with the concept is a good idea.
Quote:
There is no programming language called "RegEx". The term "regex" (in various forms of capitalization) is an abbreviation for the phrase "regular expression", which is a formal way of defining a pattern to be matched by whatever programming language is processing it. Here's a human example: Imagine you have to look through a page full of data and find all of the dates that are mixed in with, I dunno, locations, sample numbers, whatever. You are told that the dates are always listed as dd/mm/yyyy. So you read through the great wall o'text, and every time you find something that fits the pattern, you mark it. In our little example, you would be the computer, and dd/mm/yyyy would be the regex. Regexes don't really look like that, of course, but that's really all they are: patterns that a computer program matches against whatever is being examined. Here's a simple one (don't worry, it's not as scary as it looks): \d{5}(-\d{4})? That matches US postal codes in either 5-digit or 9-digit format. It looks like gibberish, but what it says is 5 of any digit, then a hyphen and 4 of any digit, with the last part optional. \d means "any digit from 0 to 9". {5} means "5 of whatever that last bit was" -- in this case, digits. Putting something in parentheses groups whatever is in the parentheses, just like in math. So if I tell you that - is just a literal hyphen, you can probably figure out what (-\d{4}) means. Spoiler:
And the final ? means that whatever precedes it (the expression in parentheses) is optional.
Mind you, regexes can get far more complicated than that. But no matter how convoluted the pattern gets, it's still a pattern, not a programming language. Just a pattern that a program tries to match to data. Writing one from scratch can be tricky, but thankfully the average person (even the average programmer) rarely has to. There are places like RegExLib to help out, including their nifty tester. When it comes to Calibre, the forums are full of masters of Regex-Fu. No, I'm not one of them, but maybe one of them will drop in and expand on my very brief explanation, especially as they relate to Calibre. |
|
05-05-2010, 04:32 AM | #2 |
Guru
Posts: 610
Karma: 4150
Join Date: Mar 2008
Device: Sony Reader PRS-T3, Kobo Libra H2O
|
Actually, the troll wasn't as off as you think: Regular expressions are a programming language, although not a "traditional" programming language (but they are reasonably close to, say, SQL). Internally, they are usually compiled (like traditional programming languages) into more or less native code which performs the matching.
|
Advert | |
|
05-05-2010, 07:57 AM | #3 |
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
I am glad he closed it ... and I kind of regret my post, even though I did try really hard to be polite. He can always start another thread if he has more to say.
As to regexes, at some point I think we need either a sticky of good regexes, or maybe, once the regex history function is added, we can put a good basic selection of regexes directly into the source, so everyone starts with a good set of flexible regexes they can try to use. |
05-05-2010, 09:30 AM | #4 |
Guru
Posts: 610
Karma: 4150
Join Date: Mar 2008
Device: Sony Reader PRS-T3, Kobo Libra H2O
|
I think you are missing the point. There are reasons why we don't see huge lists of useful regexps, but those reasons are something else than you seem to think (pardon me if I am wrong). It's not difficult to write a good regexp, and it's not difficult to put together a huge list. The problem is, each regexp is suited for one particular situation, and with e-books, the typical case is that this situation will not repeat very often, and even if it does, it requires a fair knowledge of regexps to be able to pick the right one for each situation. I mean, people can write you some excellent regexps (in fact, they did; there used to be a thread here on MobileRead, called "Tyrannosaurus regexp", I believe, which contained quite a few fine examples), but unless you are good enough with regexps, you won't be able to recognize which regexp is useful for your particular situation. And if you are good enough, it is usually easier to simply write a new regexp than to sift through ten or twenty or hundred pre-made regexps...
|
05-05-2010, 09:41 AM | #5 |
Wizard
Posts: 4,553
Karma: 950151
Join Date: Nov 2008
Device: Sony PRS-950, iphone/ipad (Marvin/iBooks/QuickReader)
|
I am familiar with regex, but not an expert. As a result I find tools like the JGSoft RegExBuddy and RegexMagic to be of great use and have been using them for many years. I do not know know if there are free equvalent to these tools around? For me the expenditure on them has more than repaid the investment.
I must admit a Calibre facility like the Save Search one recently added where one could save regex expressions and give them a name would be very useful. However with regex used in so many places throughout Calibre enabling them all to use saved and named regex expressions is likely to be a non-trivial task. |
Advert | |
|
05-05-2010, 10:02 AM | #6 |
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Agreed. I have not done nearly as much converting as I have importing, so the other uses for regexes have not made as much of an impression on me.
|
05-05-2010, 10:24 AM | #7 |
creator of calibre
Posts: 44,337
Karma: 23661992
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Actually, adding history to the regex box is on my TODO list as I have indicated on the tickets. The way I see it working is like this:
Have (name, regex pairs) where name is something descriptiove like Title - author - series - series_index Have a few predefined regexes with useful names like that. Allow the user to define new (name, regex) pairs either by modifying the builtin ones or writing them from scratch. |
05-05-2010, 11:38 AM | #8 | |
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Quote:
Another consideration might be: to make the stored combination a triplet: (name, regex, sample filename to test) so that it stores a sample filename, too. The regex "name" could be the sample filename, except that the string you are testing has to have a number in the series_index and an extension (.txt, etc.) for it to work. I find myself having to constantly copy and paste in a sample filename, even though my regex is very flexible and probably handles it. I need to test even though 90% of the time I don't actually have to change the regex and don't need the history function. (At this point, I almost need a history of tested filenames more than a history of regexes.) Just remembering the last few tested filename strings would make testing against the regex faster and easier. Note: on thinking about the idea of triplets and stored filename strings, it might be confusing if a filename string was stored against a regex that didn't properly decode it. Still, I'd prefer that, and make sure I didn't store "wrong filenames" against a regex, but I'd settle for a simple history of sample filenames to test against that I could call up and quickly edit to test and verify the regex. |
|
05-05-2010, 11:55 AM | #9 | |
Well trained by Cats
Posts: 30,372
Karma: 58053698
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
|
Quote:
|
|
05-05-2010, 12:05 PM | #10 |
creator of calibre
Posts: 44,337
Karma: 23661992
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
|
05-05-2010, 12:23 PM | #11 |
Curmudgeon
Posts: 3,085
Karma: 722357
Join Date: Feb 2010
Device: PRS-505
|
I think you have to stretch the definition of "programming language" a lot before you can fit regexes into it. And while admittedly I don't know what goes on under the hood in the average compiler, how can something entered at runtime be compiled? They sure look like data to me. We could probably argue about it all day and not come to an agreement. However, I think we can agree that they aren't what the ordinary person thinks of as a programming language -- they're just a form of representing data. So the troll was highly misleading, at best, and no doubt intentionally so.
But anyway, all annoying trolls aside: I have regexphobia. I can deal with them -- I don't have much choice, being a website developer and all -- but I'll admit, they scare me, and if humanly possible, I find an existing one and modify it as needed. I haven't had to write any for calibre yet. You'll know when I have to deal with a tricky one in calibre by the pathetic, pleading, whimpering plea for help on this forum, probably titled "Don't let the regex get me!" So even though there can't really be universally useful regexes, I'm very much in favor of some way of providing a good selection of models that can be modified to suit individual requirements. That would make life a lot easier for a lot of people. The suggestions that have come up in this thread seem like a Very Good Thing to me. P.S. Starson, so am I. That's why I said "rightfully". I don't regret my post, though. Some things need to be said. |
05-06-2010, 02:04 AM | #12 | ||||
Guru
Posts: 610
Karma: 4150
Join Date: Mar 2008
Device: Sony Reader PRS-T3, Kobo Libra H2O
|
Quote:
Quote:
Besides, when you think about it, everything that is ever compiled is "entered at runtime". A compiler takes some input (source code) and produced output (compiled code) and it does not really matter if the compilation was started on user's request (besides, why wouldn't "entering regexp into a field and pressing enter" be considered a "user's request"), whether the user typed the code in advance or "on-the-fly" (after all, what's the difference between "typing that regexp into that editbox a symbol at a time" and "pasting the whole regexp from clipboard") and whether the "user" is a real person or another program. Quote:
Quote:
I understand the value of examples if they are trying to illustrate simple points (as in, "Operator ? means either one or zero of the preceding symbol. So 'https?://' will match both 'http://' and 'https://', but not 'htp://' or 'httpss://' or 'http:www'."), but you can find those in any regular expression tutorial. For actual use, you would need something a lot more complicated, and unfortunately that also means "hard to understand", and thus "hard to adapt". In fact, some time ago I had startedwriting a tool which would take a collection of regexps and apply it to a file or a number of files. I even managed to get the tool to an almost-usable state, but then I started to actually use it and found out that it doesn't really help at all - I would have to adapt the regexps (and I was careful to enhance them in such a way as to make adaptation easy!) for every single file separately and in effect would do more work than simply writing the regexps from scratch every time. So I stopped the task and got a strong feeling of futility about preparing some universal regexp framework. Of course, I use regexps a lot so it isn't difficult for me to write a new one. Less experienced people might find it beneficial as a starting point. But I still doubt it. With every regexp, you will quickly find a situation where it doesn't work as it is, and without really understanding what's going on, you won't be able to get them to work. A rather extreme example: http://www.ex-parrot.com/pdw/Mail-RFC822-Address.html |
||||
05-06-2010, 03:01 AM | #13 |
Fairly happy old fart
Posts: 10
Karma: 10
Join Date: May 2010
Location: Mexico, China, Ecuador Philippines
Device: Palm T3, iPAQ 211
|
Hello all,
Don't mean to butt in, but I have been reading for a while and finally felt ready to post. Kovid, some suggestions for you at the end. I do NOT assume my ideas will be useful to you but I created, modified and maintained databases for 13 years. Maybe there are some things you can use there. If you ever want to bounce ideas around I'd be happy to do so. I used to work with large databases (>30 million records) years ago and we would have considerd RegEx an input filter, programming extension or a language so take your pick, and though I am far from an expert in RegEx I have programmed in many languages, including SQL, like Pepak, in the past 37 years. Anyone old enough to remember DOS batch files? It was considered a form of non compiled programming language. Seems everyone is right. Maybe even the troll. LOL Sorry, couldn't resist. I think Starson17 hit the nail on the head with his suggestion about RegEx's. His was also one of the few responses in that thread that was written in a polite, reasoned and dignified manner. You got nothing to apologize for Starson17. Most of the rest were little or no better than the supposed troll, no matter how justified the people thought they were. Have you had a lot of problems with trolls here? Just curious since what I saw was someone frustrated with a program which almost did what they wanted, but who expressed themselves badly. But, I have been told I tend to think the best of people. On soapbox. What has happened to courtesy and respect for others? I must be getting too old or lived outside the US for too long. If you disagree with someone's statements, fine, but please respond in a courteous manner. Otherwise the responder looks as ignorant, rude, stupid or crazy as the original. As my parents always said "2 wrongs do not make a right". Steps off soapbox. I respectfully disagree with Pepak, but only partially. Sorry Pepak. I think a small list of the most commonly used expressions would be easily usable by the average user. Agreed, a large list would be too confusing. I'm looking to import more than 8000 ebooks, and the collection I have, which has been collected from many sources, seems to have the file names in the following 7 formats. Author - Series Series # - Title.ext Author - Title - Series Series#.ext Author - Title.ext Last name, first name - Series Series # - Title.ext Last name, first name - Title - Series Series #.ext Last name, first name - Title.ext These last 3 could be a b***h, but the comma delimiter should make it easy with RegEx. Some with only the title, but those can easily be group edited once in Calibre to add the author and other info then the other metadata downloaded off the net. 7 expressions will import 99% of my collection. The others can usually, and easily, be bulk edited with free tools I found on nonags.com to conform to one of these basic file patterns ,then added to Calibre. Search for multi or bulk rename files. Lots of good ones there. A small # of simple expressions might be really, really useful to most people and save Kovid, and others, some work. Maybe a list of the ones above and a few others posted in a sticky would satisfy 90+% of the needs of the average user? It sure would save the people here, like yourselves, a lot of time. Nice to meet you all, and I hope to make use of Calibre in the future. At present the database/library functions don't do what I need them to, mainly on the output structure when saved, and the inability to create the metadata.db without reading and saving the input files, but there is certainly hope for the future and it's better than anything else I've seen. Kovid, it looks like you are saving all the input files in the structure you decided on but referencing them with the incrementing (xxx) as part of the {title} folder name? Wouldn't that limit the database to 1000 folders or are you allowing for more in another fashion? It also seems to continue to increment from where it left off even when records are deleted. Do you plan on reusing the (xxx) pointers? On the bright side since you are using the (xxx) as the location pointer it should be easy to implement different output folder structures and add without saving so we can all have our cake and eat it too. Would it be better to simply index existing folders and add the (xxx) to the folder name so Calibre could find the files? Pros and cons below. This would allow building the metadata.db without saving the files to a new location. Fast library updates. More flexibility and the ability to quickly and easily update the metadata from the file data if the metadata, as is the case in many of my files, is incorrect. You could always give us a choice to structure and save when adding or just add to the database. The present way has the advantage of correcting metadata within the ebook files that support it and for small collections makes a lot of sense. If just adding, a user could later do a save to bring the errant files into the same folder/file structure as everything else and add/correct metadata within the ebook files and/or out to an OPF file. With a large collection this might be the better way since the "download metadata and covers" function could be run first to update the library. Is the Author's name only one field? Bet it is. How difficult would it be to make it 2 fields so the output folders could be changed? Currently {Author}, etc gives First Last. 2 fields allows Last, First which is how most databases are organized. That would make data exchange with other programs much easier. Please feel free to ignore or not implement any of these ideas. I can easily do them in Access or even in Excel, but have no idea how to do them in Calibre so all I can do is make suggestions. Sorry for the long post. I just had to get it all out. Last edited by Disfrutalavida; 05-06-2010 at 03:55 AM. Reason: Spelling and grammar. |
05-06-2010, 07:50 AM | #14 | |||||||
Grand Sorcerer
Posts: 11,939
Karma: 7219261
Join Date: Jan 2010
Location: Notts, England
Device: Kobo Libra 2
|
Yep. And also 1620 arithmetic tables.
Quote:
Quote:
Quote:
Calibre assumes that the library is a black box, never to be looked at. Books are stored as pseudo-BLOBs, but instead of using a real BLOB the book is stored in the filesystem using a computed directory path. This is done (I think) for safety and performance, not to make the files available for processing by external tools. If one thinks of the books as BLOBs, then copying the data in and out becomes the natural thing to do. The path becomes a form of table name, and the book formats (and other things) are columns within that table. Quote:
Quote:
Quote:
It is worth noting that although author is a single field (as you guessed), the collection of authors is normalized and not stored directly in the book record. Author_sort is a denormalized form of author, stored in the book record, so that corrections can be made in sort order. Without going into my normal rant about this, consider the difference in storing and sorting Chinese names vs western-style names (or Japanese, for that matter). The complexities come close to forcing the denormalization. Quote:
Last edited by chaley; 05-06-2010 at 08:07 AM. |
|||||||
05-06-2010, 08:54 AM | #15 |
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
From reading very old threads, I believe the earliest versions of calibre stored ebooks as real BLOBs within the calibre database. From what I read, the change to pseudo-BLOBs stored in a computed directory path was made to improve performance.
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
What's wrong with this regex? | crutledge | Sigil | 1 | 05-11-2010 01:49 PM |
Multiline Regex? | prky | Calibre | 25 | 05-01-2010 09:56 PM |
Help with a regex | A.T.E. | Calibre | 1 | 04-05-2010 07:50 AM |
help with regex expression | daesdaemar | Workshop | 4 | 02-19-2010 07:38 AM |
Regex help... | Bobthebass | Workshop | 6 | 04-26-2009 03:54 PM |