Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre > Conversion

Notices

Closed Thread
 
Thread Tools Search this Thread
Old 09-19-2010, 04:42 PM   #1
Manichean
Wizard
Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!
 
Manichean's Avatar
 
Posts: 3,130
Karma: 80446
Join Date: Feb 2008
Location: Germany
Device: Cybook Gen3
Regular expressions, Calibre and you- an introduction (Archived)

Revision 8 (27.09. 2010)

The intent of this introduction is not so much to explain all finesses of regular expression usage, but rather to explain enough to handle some common tasks in Calibre and get new users started and knowledgeable enough that they can further educate themselves using the (rather technical) explanation given in the Python documentation, linked through the Calibre manual. So, let's get started.

First, a word of warning and a word of courage: This is, inevitably, going to be somewhat technical- after all, regular expressions are a technical tool for doing technical stuff. I'm going to have to use some jargon and concepts that may seem complicated or convoluted. I'm going to try to explain those concepts as clearly as I can, but really can't do without using them at all. That being said, don't be discouraged by any jargon, as I've tried to explain everything new. And while regular expressions themselves may seem like an arcane, black magic (or, to be more prosaic, a random string of mumbo-jumbo letters and signs), I promise that they are not all that complicated. Even those who understand regular expressions really well have trouble reading the more complex ones, but writing them isn't as difficult- you construct the expression step by step. So, take a step and follow me into the rabbit's hole.

Where in Calibre can you use regular expressions?
There are a few places Calibre uses regular expressions. There's the header/footer removal in conversion options, metadata detection from filenames in the import settings and, since last version, there's the option to use regular expressions to search and replace in metadata of multiple books.

What on earth is a regular expression?
A regular expression is a way to describe a particular string of characters (string for short). (Technical note: I'm using string here in the sense it is used in programming languages: a string of one or more characters, characters including actual characters, numbers, punctuation and so-called whitespaces (linebreaks, tabulators etc.). Please note that generally, uppercase and lowercase characters are not considered the same, thus "a" being a different character from "A" and so forth. In Calibre, regular expressions are case insensitive in the search bar, but not in the conversion options. There's a way to make every regular expression case insensitive, but we'll discuss that later.) It gets complicated because regular expressions allow for variations in the strings it matches, so one expression can match multiple strings, which is why people bother using them at all. More on that in a bit.

Care to explain?
Well, that's why we're here. First, this is the most important concept in regular expressions: A string in itself is a regular expression that matches itself. That is to say, if I wanted to match the string "Hello, World!" using a regular expression, the regular expression to use would be
Code:
Hello, World!
And yes, it really is that simple. You'll notice, though, that this only matches the exact string "Hello, World!", not e.g. "Hello, wOrld!" or "hello, world!" or any other such variation.

That doesn't sound too bad. What's next?
Next is the beginning of the really good stuff. Remember where I said that regular expressions can match multiple strings? This is were it gets a little more complicated. Say, as a somewhat more practical exercise, the ebook you wanted to convert had a nasty footer counting the pages, like "Page 5 of 423". Obviously the page number would rise from 1 to 423, thus you'd have to match 423 different strings, right? Wrong, actually: regular expressions allow you to define sets of characters that are matched: To define a set, you put all the characters you want to be in the set into square brackets. So, for example, the set
Code:
[abc]
would match either the character "a", "b" or "c". Sets will always only match one of the characters in the set. They "understand" character ranges, that is, if you wanted to match all the lower case characters, you'd use the set
Code:
[a-z]
for lower- and uppercase characters you'd use
Code:
[a-zA-Z]
and so on. Got the idea? So, obviously, using the expression
Code:
Page [0-9] of 423
you'd be able to match the first 9 pages, thus reducing the expressions needed to three: The second expression
Code:
Page [0-9][0-9] of 423
would match all two-digit page numbers, and I'm sure you can guess what the third expression would look like. Yes, go ahead. Write it down.

Hey, neat! This is starting to make sense!
I was hoping you'd say that. But brace yourself, now it gets even better! We just saw that using sets, we could match one of several characters at once. But you can even repeat a character or set, reducing the number of expressions needed to handle the above page number example to one. Yes, ONE! Excited? You should be! It works like this: Some so-called special characters, "+", "?" and "*", repeat the single element preceding them. (Element means either a single character, a character set, an escape sequence or a group (we'll learn about those last two later)- in short, any single entity in a regular expression.) These characters are called wildcards or quantifiers. To be more precise, "?" matches 0 or 1 of the preceding element, "*" matches 0 or more of the preceding element and "+" matches 1 or more of the preceding element. A few examples: The expression "a?" would match either "" (which is the empty string, not strictly useful in this case) or "a", the expression "a*" would match "", "a", "aa" or any number of a's in a row, and, finally, the expression "a+" would match "a", "aa" or any number of a's in a row (Note: it wouldn't match the empty string!). Same deal for sets: The expression
Code:
[0-9]+
would match every integer number there is! I know what you're thinking, and you're right: If you use that in the above case of matching page numbers, wouldn't that be the single one expression to match all the page numbers? Yes, the expression
Code:
Page [0-9]+ of 423
would match every page number in that book!
A note on these quantifiers: They generally try to match as much text as possible, so be careful when using them. This is called "greedy behaviour"- I'm sure you get why. It gets problematic when you, say, try to match a tag. Consider, for example, the string "<p class="calibre2">Title here</p>" and let's say you'd want to match the opening tag (the part between the first pair of angle brackets, a little more on tags later). You'd think that the expression
Code:
<p.*>
would match that tag, but actually, it matches the whole string! (The character "." is another special character. It matches anything except linebreaks, so, basically, the expression
Code:
.*
would match any single line you can think of.) Instead, try using
Code:
<p.*?>
which makes the quantifier "*" non-greedy. That expression would only match the first opening tag, as intended.
There's actually another way to accomplish this: The expression
Code:
<p[^>]*>
will match that same opening tag- you'll see why after the next section. Just note that there quite frequently is more than one way to write a regular expression.

Well, these special characters are very neat and all, but what if I wanted to match a dot or a question mark?
You can of course do that: Just put a backslash in front of any special character and it is interpreted as the literal character, without any special meaning. This pair of a backslash followed by a single character is called an escape sequence, and the act of putting a backslash in front of a special character is called escaping that character. An escape sequence is interpreted as a single element. There are of course escape sequences that do more than just escaping special characters, for example "\t" means a tabulator. We'll get to some of the escape sequences later. Oh, and by the way, concerning those special characters: Consider any character we discuss in this introduction as having some function to be special and thus needing to be escaped if you want the literal character.

So, what are the most useful sets?
Knew you'd ask. Some useful sets are
Code:
[0-9]
matching a single number,
Code:
[a-z]
matching a single lowercase letter,
Code:
[A-Z]
matching a single uppercase letter,
Code:
[a-zA-Z]
matching a single letter and
Code:
[a-zA-Z0-9]
matching a single letter or number. You can also use an escape sequence as shorthand:
Code:
\d is equivalent to [0-9]
\w is equivalent to [a-zA-Z0-9_]
\s is equivalent to any whitespace
("Whitespace" is a term for anything that won't be printed. These characters include space, tabulator, line feed, form feed and carriage return.) As a last note on sets, you can also define a set as any character but those in the set. You do that by including the character "^" as the very first character in the set. Thus,
Code:
[^a]
would match any character excluding "a". That's called complementing the set. Those escape sequence shorthands we saw earlier can also be complemented: "\D" means any non-number character, thus being equivalent to
Code:
[^0-9]
The other shorthands can be complemented by, you guessed it, using the respective uppercase letter instead of the lowercase one. So, going back to the example
Code:
<p[^>]*>
from the previous section, now you can see that the character set it's using tries to match any character except for a closing angle bracket.

But if I had a few varying strings I wanted to match, things get complicated?
Fear not, life still is good and easy. Consider this example: The book you're converting has "Title" written on every odd page and "Author" written on every even page. Looks great in print, right? But in ebooks, it's annoying. You can group whole expressions in normal parentheses, and the character "|" will let you match either the expression to its right or the one to its left. Combine those and you're done. Too fast for you? Okay, first off, we group the expressions for odd and even pages, thus getting
Code:
(Title)
(Author)
as our two needed expressions. Now we make things simpler by using the vertical bar ("|" is called the vertical bar character): If you use the expression
Code:
(Title|Author)
you'll either get a match for "Title" (on the odd pages) or you'd match "Author" (on the even pages). Well, wasn't that easy?
You can of course use the vertical bar without using grouping parentheses, as well. Remember when I said that quantifiers repeat the element preceding them? Well, the vertical bar works a little differently: The expression "Title|Author" will also match either the string "Title" or the string "Author", just as the above example using grouping. The vertical bar selects between the entire expression preceding and following it. So, if you wanted to match the strings "Calibre" and "calibre" and wanted to select only between the upper- and lowercase "c", you'd have to use the expression "(c|C)alibre", where the grouping ensures that only the "c" will be selected. If you were to use "c|Calibre", you'd get a match on the string "c" or on the string "Calibre", which isn't what we wanted. In short: If in doubt, use grouping together with the vertical bar.

You missed...
... wait just a minute, there's one last, really neat thing you can do with groups. If you have a group that you previously matched, you can use references to that group later in the expression: Groups are numbered starting with 1, and you reference them by escaping the number of the group you want to reference, thus, the fifth group would be referenced as "\5". So, if you searched for "([^ ]+) \1" in the string "Test Test", you'd match the whole string!
That's really incredibly useless...
Oh, you'll see.

You missed something. In the beginning, you said there was a way to make a regular expression case insensitive?
Yes, I did, thanks for paying attention and reminding me. You can tell Calibre how you want certain things handled by using something called flags. You include flags in your expression by using the special construct
Code:
(?flags go here)
where, obviously, you'd replace "flags go here" with the specific flags you want. For ignoring case, the flag is "i", thus you include "(?i)" in your expression. Thus,
Code:
test(?i)
would match "Test", "tEst", "TEst" and any case variation you could think of.
Another useful flag lets the dot match any character at all, including the newline, the flag "s". If you want to use multiple flags in an expression, just put them in the same statement: "(?is)" would ignore case and make the dot match all. It doesn't matter which flag you state first, "(?si)" would be equivalent to the above. By the way, good places for putting flags in your expression would be either the very beginning or the very end. That way, they don't get mixed up with anything else.

I think I'm beginning to understand these regular expressions now... how do I use them in Calibre?
Let's begin with the conversion settings, which is really neat. In the structure detection part, you can input a regexp (short for regular expression) that describes the header or footer string that will be removed during the conversion. The neat part is the wizard (Go ahead, give Kovid some karma already, the man deserves it!): Click on the wizard staff and you get a preview of what Calibre "sees" during the conversion process. Scroll down to the header or footer you want to remove, select and copy it, paste it into the regexp field on top of the window. If there are variable parts, like page numbers or so, use sets and quantifiers to cover those, and while you're at it, rememper to escape special characters, if there are some. Hit the button labeled "Test" and Calibre highlights the parts it would remove were you to use the regexp. Once you're satisfied, hit OK and convert. Be careful if your conversion source has tags like this example:
Code:
"Maybe, but the cops feel like you do, Anita. What's one more dead vampire? New laws don't change that." </p><p class="calibre4">
<b class="calibre2">Generated by ABC Amber LIT Conv<a href="http://www.processtext.com/abclit.html" class="calibre3">erter, http://www.processtext.com/abclit.html</a></b></p><p class="calibre4">
It had only been two years since Addison v. Clark. The court case gave us a revised version of what life was
(Shamelessly ripped out of this thread.) You'd have to remove some of the tags as well. In this example, I'd recommend beginning with the tag
Code:
<b class="calibre2">
, now you have to end with the corresponding closing tag (opening tags are <tag>, closing tags are </tag>), which is simply the next
Code:
</b>
in this case. (Refer to a good HTML manual or ask in the forum if you are unclear on this point.) The opening tag can be described using "<b.*?>", the closing tag using "</b>", thus we could remove everything between those tags using
Code:
<b.*?>.*?</b>
But using this expression would be a bad idea, because it removes everything enclosed by <b>- tags (which, by the way, render the enclosed text in bold print), and it's a fair bet that we'll remove portions of the book in this way. Instead, include the beginning of the enclosed string as well, making the regular expression
Code:
<b.*?>\s*Generated\s+by\s+ABC\s+Amber\s+LIT.*?</b>
The \s with quantifiers are included here instead of explicitly using the spaces as seen in the string to catch any variations of the string that might occur. Remember to check what Calibre will remove to make sure you don't remove any portions you want to keep if you test a new expression. If you only check one occurence, you might miss a mismatch somewhere else in the text. Also note that should you accidentally remove more or fewer tags than you actually wanted to, Calibre tries to repair the damaged code after doing the header/footer removal.
Another thing you can use regular expressions for is to extract metadata from filenames. You can find this feature in the "Adding books" part of the settings. There's a special feature here: You can use field names for metadata fields, for example (?P<title>) would indicate that calibre uses this part of the string as book title. The allowed field names are listed in the windows, together with another nice test field (Remember that karma you wanted to give Kovid?). An example: Say you want to import a whole bunch of files named like "Classical Texts: The Divine Comedy by Dante Alighieri.mobi" (Obviously, this is already in your library, since we all love classical italian poetry ) or "Science Fiction epics: The Foundation Trilogy by Isaac Asimov.epub". This is obviously a naming scheme that Calibre won't extract any meaningful data out of- its standard expression for extracting metadata is
Code:
(?P<title>.+) - (?P<author>[^_]+)
A regular expression that works here would be
Code:
[a-zA-Z]+: (?P<title>.+) by (?P<author>.+)
Please note that, inside the group for the metadata field, you need to use expressions to describe what the field actually matches. And also note that, when using the test field Calibre provides, you need to add the file extension to your testing filename, otherwise you won't get any matches at all, despite using a working expression.
The last part is regular expression search and replace in metadata fields. You can access this by selecting multiple books in the library and using bulk metadata edit. Be very careful when using this last feature, as it can do Very Bad Things to your library! Doublecheck that your expressions do what you want them to using the test fields, and only mark the books you really want to change! In the regular expression search mode, you can search in one field, replace the text with something and even write the result into another field. A practical example: Say your library contained the books of Frank Herbert's Dune series, named after the fashion "Dune 1 - Dune", "Dune 2 - Dune Messiah" and so on. Now you want to get "Dune" into the series field. You can do that by searching for "(.*?) \d+ - .*" in the title field and replacing it with "\1" in the series field. See what I did there? That's a reference to the first group you're replacing the series field with. Now that you have the series all set, you only need to do another search for ".*? - " in the title field and replace it with "" (an empty string), again in the title field, and your metadata is all neat and tidy. Isn't that great? By the way, instead of replacing the entire field, you can also append or prepend to the field, so, if you wanted the book title to be prepended with series info, you could do that as well. As you by now have undoubtedly noticed, there's a checkbox labeled "Case sensitive", so you won't have to use flags to select behaviour here.

Well, that just about concludes the very short introduction to regular expressions. Hopefully I'll have shown you enough to at least get you started and to enable you to continue learning by yourself- a good starting point would be the Python documentation for regexpes.
One last word of warning, though: Regexpes are powerful, but also really easy to get wrong. Calibre provides really great testing possibilities to see if your expressions behave as you expect them to. Use them. Try not to shoot yourself in the foot. (God, I love that expression...) But should you, despite the warning, injure your foot (or any other body parts), try to learn from it.

Credits:
Thanks for helping with tips, corrections and such:
  • ldolse
  • kovidgoyal
  • chaley
  • dwanthny
  • kacir
  • Starson17

Edit history:
  • added greedy quantifiers, some useful escape sequences, string groups, warning at the end. Still to come: some more practical examples.
  • The editing field is pathetically small for larger posts... Edited for style, tried to clarify distinction between use of parentheses and square brackets (groups vs. sets), notes on strings in general, added some examples, included some flags.
  • Further explained vertical bar usage, rewrote footer removal example, rewrote some parts to make better didactical sense, cleaned up formatting to hopefully make it more coherent.
  • added re-referencing groups, search & replace metadata edit.
  • corrected and clarified character case usage in Calibre
  • corrected error in using groups with quantifiers
  • changed "pipe" to "vertical bar" to avoid confusion

Last edited by Manichean; 01-26-2011 at 05:37 PM. Reason: edit, see history
Manichean is offline  
Old 09-19-2010, 08:35 PM   #2
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
You need to be careful about deleting everything between <p> and </p> tags. In that particular example book if you did that you would delete actual book text in addition to the headers.

While it's generally a good idea to always try to remove both the opening and closing tags, the only format I think that's critical for is epub. Calibre will force the files into xhtml spec if it discovers they're out of spec. (I think for epub it assumes they're in spec, so you could really screw up epub)

generally .*? is better than .*, and will usually do what users actually want it to. I'd use ? instead of *? to make something optional.

You can think of brackets [] as single character groupings, but for string groupings use parentheses and |
(one|two|three|four)

A few other useful expressions:
Matching p tags with any styles/ids:
<p[^>]*>

Never specify actual spaces in your regular expression. Use \s, which tells regex to look for a space. Better yet use \s+ or \s*, which match one or more spaces or zero or more spaces respectively. I make liberal use of \s* in my expressions because you never know when a stray space will hurt you. \s* also has the benefit of passing through any whitespace including tabs and carriage returns. So when you really do need to match everything between <p></p>, except your opening and closing tags are across lines, you can use \s* to get you there.

Last edited by ldolse; 09-23-2010 at 03:12 PM.
ldolse is offline  
 
Enthusiast
Old 09-19-2010, 09:03 PM   #3
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 25,923
Karma: 5035037
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
calibre will try to fix broken html for EPUB as well.
kovidgoyal is offline  
Old 09-20-2010, 04:02 AM   #4
chaley
"chaley", not "charley"
chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.
 
Posts: 5,430
Karma: 831552
Join Date: Jan 2010
Location: France
Device: Many android devices
Quote:
Originally Posted by ldolse View Post
A few other useful expressions:
Matching p tags with any styles/ids:
<p[^]*>
Don't you mean '<p[^>]*>'? Also, this won't work if any of the internal attributes contain a > character. Consider <p foo=">">.

This exposes one of the problems with regular expressions. Using regexps, it is difficult to do delimited matching in the constrained case, and impossible in the general case. Doing it right usually requires a recursive state machine, which by definition cannot be described by a regular expression. For fun, try to write a regular expression that matches any palindrome. (http://en.wikipedia.org/wiki/Palindrome. Examples: abcdedcba or 'madam im adam' with spaces ignored.) You will fail.

Edit: the paragraph above deals with computational theory and does not belong in a tutorial. However, it might be useful for Manichean, which is why I added it.

Last edited by chaley; 09-20-2010 at 04:06 AM.
chaley is offline  
Old 09-20-2010, 04:27 AM   #5
Manichean
Wizard
Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!
 
Manichean's Avatar
 
Posts: 3,130
Karma: 80446
Join Date: Feb 2008
Location: Germany
Device: Cybook Gen3
Thanks for the suggestions, all of you. I'll edit those in sometime the next few days.
I do have one question, though...

Quote:
Originally Posted by ldolse View Post
You need to be careful about deleting everything between <p> and </p> tags. In that particular example book if you did that you would delete actual book text in addition to the headers.
I don't see that in this example. I haven't actually tried it, but wouldn't matching, in this case, start at "<p class="calibre4"><b class="calibre2">Generated by..." and finish with the closing "</p>"? In which case, no book text would be removed?
Manichean is offline  
Old 09-20-2010, 05:09 AM   #6
DoctorOhh
US Navy, Retired
DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.
 
DoctorOhh's Avatar
 
Posts: 8,838
Karma: 12535517
Join Date: Feb 2009
Location: North Carolina
Device: Nexus 7
Quote:
Originally Posted by Manichean View Post
I don't see that in this example. I haven't actually tried it, but wouldn't matching, in this case, start at "<p class="calibre4"><b class="calibre2">Generated by..." and finish with the closing "</p>"? In which case, no book text would be removed?
I know nothing about regex and I am glad you are attempting this primer for those of us without any idea what a regex is.

As I see it in your example above, a brief review makes it look like you are saying to start your regex with <p class="calibre4"> and end with </p> without being much more specific this will include many other paragraphs of text as well.

Looking forward to the finished primer.
DoctorOhh is offline  
Old 09-20-2010, 05:29 AM   #7
Manichean
Wizard
Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!
 
Manichean's Avatar
 
Posts: 3,130
Karma: 80446
Join Date: Feb 2008
Location: Germany
Device: Cybook Gen3
Quote:
Originally Posted by dwanthny View Post
I know nothing about regex and I am glad you are attempting this primer for those of us without any idea what a regex is.

As I see it in your example above, a brief review makes it look like you are saying to start your regex with <p class="calibre4"> and end with </p> without being much more specific this will include many other paragraphs of text as well.
Well, that depends on what the regexp really is like- if you read carefully, you'll notice I've only mentioned that you should remove corresponding tags, I haven't really given a regexp. Given the example text I've used, I don't see how cutting out the part between the opening and closing <p>-tags removes any book text.
Manichean is offline  
Old 09-20-2010, 05:55 AM   #8
DoctorOhh
US Navy, Retired
DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.
 
DoctorOhh's Avatar
 
Posts: 8,838
Karma: 12535517
Join Date: Feb 2009
Location: North Carolina
Device: Nexus 7
Quote:
Originally Posted by Manichean View Post
Well, that depends on what the regexp really is like- if you read carefully, you'll notice I've only mentioned that you should remove corresponding tags, I haven't really given a regexp. Given the example text I've used, I don't see how cutting out the part between the opening and closing <p>-tags removes any book text.
Again, I have no expertise in this area, but I do have experience being an ignorant noob willing to attempt anything. So in that regard I am trying to lend a fresh set of eyes.

I'm not arguing that you were trying to remove everything between <p> </p> tags. I was just pointing out the obvious way ldolse saw the connection of deleting anything between <p> </p> tags. A new user reading a primer would take away the opening and closing info from your code tags and then try to start applying expressions to everything in between.

To stem off this type of confusion your top code box should be more specific. To show the opening tag by including "<p class="calibre4"><b class="calibre2">Generated by".

You stop the initial part of the primer in a spot that could get and energetic user in trouble.

One more thing, I could be far off base, but since folks will be seeing the below code in their book viewer, html viewer or Sigil, and the below will be word wrapped in those viewers (or not I'm not sure), wouldn't it be better to put it in quotes so users can see the entire picture?

Quote:
"Maybe, but the cops feel like you do, Anita. What's one more dead vampire? New laws don't change that." </p><p class="calibre4"> <b class="calibre2">Generated by ABC Amber LIT Conv<a href="http://www.processtext.com/abclit.html" class="calibre3">erter, http://www.processtext.com/abclit.html</a></b></p><p class="calibre4"> It had only been two years since Addison v. Clark. The court case gave us a revised version of what life was
Great work so far!

Update: In my experience if the above is in a <p class="calibre4"> tag most every paragraph in the book will be using that tag too. That's why any primer needs to emphasize ways of limiting your expression to avoid accidentally removing your entire text.

Last edited by DoctorOhh; 09-20-2010 at 06:05 AM.
DoctorOhh is offline  
Old 09-20-2010, 06:45 AM   #9
Manichean
Wizard
Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!
 
Manichean's Avatar
 
Posts: 3,130
Karma: 80446
Join Date: Feb 2008
Location: Germany
Device: Cybook Gen3
Ah, now I see what you mean. Thanks for the input, especially the part about getting energetic users in trouble, I'll have to think about that.

Quote:
Originally Posted by dwanthny View Post
One more thing, I could be far off base, but since folks will be seeing the below code in their book viewer, html viewer or Sigil, and the below will be word wrapped in those viewers (or not I'm not sure), wouldn't it be better to put it in quotes so users can see the entire picture?
I'm deliberately using code-tags, since they should preserve linebreaks as originally written. The Calibre regexp wizard, as far as I know, does use wordwrapping, but should interpret the code with the original linebreaks. I chose this way to avoid confusion, since everyone will be seeing the same stuff in the code-tags.
Manichean is offline  
Old 09-20-2010, 07:20 AM   #10
kacir
Wizard
kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.
 
kacir's Avatar
 
Posts: 2,784
Karma: 3098803
Join Date: May 2006
Device: PocketBook 360, before it was Sony Reader, cassiopeia A-20
Great work.
I am big fan of Regular Expressions, and I have recently started to use Calibre for other things than for just a conversion now and then.

I will keep close eye on this thread.
Regular Expressions are very powerful stuff and deserve to be popularized a little bit more.

Quote:
Originally Posted by Manichean View Post
I don't see that in this example.
?, + and * are called quantifiers, because they quantify whatever lies before them.
The very first thing that a beginner needs to know about those standard quantifiers, you can see in any RE implementation is, that they are GREEDY.
Yes, there are also non-greedy quantifiers, as one of previous posters pointed out. In Python syntax those are *?, +?, ??.
Yes, there are *many* different syntaxes for Regular Expressions. I won't go further, I do not want to scare our dear readers away ;-)

A '*' quantifier will eat as much of the string as it can.
Let's have an example. You have string
'AuthorFirstName AuthorLastName - series - title.epub'
and you want to match 'AuthorFirstName AuthorLastName - '. So, you write an expression like:
'.* - ' to match Author. But! '.' matches any character and '*' quantifier takes as much as possible, so instead of matching 'AuthorFirstName AuthorLastName - ' as you have intended, you will match 'AuthorFirstName AuthorLastName - series - '

You need to search for
'[^-]* - '
'[^-]' means match ANY character BUT '-'

If the first character in a group is '^' the rest of group is effectively a list of characters that are NOT supposed be matched.


I very, *very* strongly recommend THE best^H^H^H^Hmost exhaustive (pun intended) book ever written about Regular Expressions - Mastering Regular Expressions - Book on regular expressions by Jeffrey Friedl, published by O’Reilly.
Please see http://docs.python.org/library/re.html for Recomandation about which version of book to use
The book is difficult, but worth its weight in gold if you want to understand Regular Expressions.
kacir is offline  
Old 09-20-2010, 05:56 PM   #11
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
Quote:
Originally Posted by chaley View Post
Don't you mean '<p[^>]*>'? Also, this won't work if any of the internal attributes contain a > character. Consider <p foo=">">.
I did mean '<p[^>]*>', thanks. I was aware of the problem with internal attributes being a problem for that expression, but I figured that was getting into a rat-hole for this level of tutorial and in practice it's pretty rare.
ldolse is offline  
Old 09-20-2010, 06:57 PM   #12
Manichean
Wizard
Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!
 
Manichean's Avatar
 
Posts: 3,130
Karma: 80446
Join Date: Feb 2008
Location: Germany
Device: Cybook Gen3
Edited the tutorial to include suggestions as of now. It'd be great if the resident gurus could give it another read. Thank you, guys- you know who you are
Manichean is offline  
Old 09-21-2010, 03:23 AM   #13
chaley
"chaley", not "charley"
chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.
 
Posts: 5,430
Karma: 831552
Join Date: Jan 2010
Location: France
Device: Many android devices
Thank you for writing this!

Comments below. Some are pedantic, but I can't help it. Others are personal preference. All can be ignored.

Quote:
Originally Posted by Manichean View Post
What on earth is a regular expression?
A regular expression is a way to describe a particular string of characters (string for short). Technical note: I'm using string here in the sense it is used in programming languages: a string of two or more characters, characters including actual characters, numbers, punctuation and so-called whitespaces (linebreaks, tabulators etc.). It gets complicated because regular expressions allow for variations in the strings it matches, so one expression can match multiple strings. More on that in a bit.
I regular-expression land (and in many programming languages), a single character is a string.
Quote:
Care to explain?
Well, that's why we're here. First, this is the most important concept in regular expressions: A string in itself is a regular expression that matches itself. That is to say, if I wanted to match the string "Hello, World!" using a regular expression, the regular expression to use would be
Code:
Hello, World!
Do you want to mention that 'H' and 'h' are different characters? Perhaps you do further down.
Quote:
And yes, it really is that simple. A word of warning: There are a handful of special characters that have some special function in regular expressions.
...
I would suggest that this complexity be pushed down in the document. I the poor reader don't know yet why I care. Introduce the need for escaping when the problem arises. Instead, give examples here. You might introduce ignoring case at this point in the examples.
Quote:
My head is spinning...
Already? We're only just getting to the good stuff. Okay, take a breath and relax... feeling better? I promise, I'll try to take it slow and keep it simple. Remember where I said that regular expressions can match multiple strings? This is were it gets a little more complicated.
You might want to start with a simpler example. One might be recognizing a particular author, say Pierre-Yves Trudeau. This author might appear as Pierre Yves Trudeau, P Y Trudeau, Pierre Trudeau, or P. Y. Trudeau. You decide that anything starting with P and ending with ' Trudeau' should match. This introduces '.' and quantifiers.
Quote:
Say, as a somewhat more practical exercise, the ebook you wanted to convert had a nasty footer counting the pages, like "Page 5 of 423". Obviously the page number would rise from 1 to 423, thus you'd have to match 423 different strings, right, Wrong, actually: regular expressions allow you to define groups of characters that are matched: To define a group, you put all the characters you want to be in the group into square brackets. So, for example, the group
Code:
[abc]
would match either the character "a", "b" or "c". Groups will always only match one of the characters in the group.
Unless the group is [^abc], in which case it will always match characters that are *not* in the group.
Quote:
Groups "understand" character ranges, that is, if you wanted to match all the lower case characters, you'd use the group
Code:
[a-z]
, for lower- and uppercase characters you'd use
Code:
[a-zA-Z]
and so on.
Consider whether or not you want to introduce the shorthand character classes. Perhaps not here, but maybe somewhere? I mean: '\d'==[0-9],
'\D'==[^0-9], \s== (set of whitespace) (this one is important), \w==[a-zA-Z0-9_] (Note: I see that you did this further down. )
Quote:
Got the idea? ...
It works like this: Some of the special characters, "+", "?" and "*", repeat the character or group preceding them. These characters are called wildcards or quantifiers.
You might want to be precise with your wording. 'Using 'group' will get you into trouble eventually, when it gets confused with grouping for alternation (or) and for backreferences. I suggest that you use 'set' or 'class' for [] expressions, and reserve the word 'group' for parenthesized expressions.

You might also want to introduce the word 'element', which means a character or class or (eventually) group. Quantification applies to the previous element. (The computer scientist in me wants to get into recursion, but that would be a disaster. )
Quote:
To be more precise, "?" matches 0 or 1 of the preceding character/group,
This is where you could use 'element'.
Quote:
...
I know what you're thinking, and you're right: If you use that in the above case of matching page numbers, wouldn't that be the single one expression to match all the page numbers? Yes, the expression
Code:
Page [0-9]+ of 423
would match every page number in the book! And then some, but that's not the concern here.
I wouldn't include the last sentence. Even I don't understand what you are trying to say.
Quote:
A note on these quantifiers: They generally try to match as much text as possible, so be careful when using them. This behaviour is called "greedy quantifiers"-
The behaviour is 'greedy'. The quantifier specifies whether or not the behavior is greedy.
Quote:
I'm sure you get why. This gets problematic when you, say, try to match a tag. Consider, for example, the string "<p class="calibre2">Title here</p>" and
Be careful with your choice of delimiters. Delimiting by " when there are embedded " can lead to confusion. I suggest that you use the CODE tags here, as you have most other places, and not delimit the string at all.
Quote:
let's say you'd want to match the opening tag (the part between the first pair of angle brackets, a little more on tags later). You'd think that the expression
Code:
<p.*>
would match that tag, but actually, it matches the whole string! (The character ".", as noted before,
I don't see where it was noted before.
Quote:
is a special character. It matches anything except linebreaks,
I know you don't want to introduce DOTALL here, but I want to make sure you know about it. If DOTALL is in effect, the dot will match line endings.
Quote:
so, basically, the expression
Code:
.*
would match any single line you can think of. That's less useful than it may seem.)
Why is it less useful? I suggest you don't confuse things here. Show why it can be less useful when the problem arises.
Quote:
Instead, try using
Code:
<p.*?>
which makes the quantifier "*" non-greedy.
You might want to explain again the difference between greedy and non-greedy, because the concept is so important. To paraphrase LOLcats and icanhascheezburger: Greedy: I eatz all the cheezburgers, savin none fur yu. Non-greedy: I eatz one cheezburger, savin the restz fur yu.
Quote:
...
...
The book you're converting has "Title" written on every odd page and "Author" written on every even page. Looks great in print, right? But in ebooks, it's annoying. You can group whole expressions in normal parentheses, and the character "|" will let you match either the expression to its right or the one to its left.
See, now we have ambiguity in the term 'group'
Quote:
...Now we make things simpler by using the pipe ("|" is called the pipe character): If you use the expression
Code:
(Title|Author)
you'll either get a match for "Title" (on the odd pages) or you'd match "Author" (on the even pages). Well, wasn't that easy?
It is called the pipe on *nix systems. No where else. You might consider calling it a 'vertical bar', or 'bar' for short. I also might put the 'or' above in caps, or bold, or something to draw attention to what the bar is doing.
Quote:
...
and while you're at it, rememper to escape special characters,
Quote:
s/rememper/remember/
...
Be careful if your conversion source has tags like this example:
Code:
"Maybe, but the cops feel like you do, Anita. What's one more dead vampire? New laws don't change that." </p><p class="calibre4">
<b class="calibre2">Generated by ABC Amber LIT Conv<a href="http://www.processtext.com/abclit.html" class="calibre3">erter, http://www.processtext.com/abclit.html</a></b></p><p class="calibre4">
It had only been two years since Addison v. Clark. The court case gave us a revised version of what life was
(Shamelessly ripped out of this thread.) You'd have to remove some of the tags as well.
At this point you are starting a tutorial on HTML. Do you really want to do that? Perhaps you do...
Quote:
...
Also note that Calibre tries to repair damaged code after doing the header/footer removal.
Secondly, ....
The 'Firstly' was a long time back. You might want to make the transition more explicit, such as 'Now lets look at another use of regexps in calibre' or some such.
Quote:
you can use regular expressions to extract metadata from filenames. You can find this feature in the "Adding books" part of the settings.
Assuming you aren't bored with writing, a few examples would be good here. Then the next paragraph would be better situated.
Quote:
There's a special feature here: You can use field names for metadata fields, for example (?P<title>) would indicate that calibre uses this part of the string as book title.
...
Edit: added greedy quantifiers, some useful escape sequences, string groups, warning at the end. Still to come: some more practical examples.
My opinion: what you are doing here is very good. You are situating a complicated topic within the environment it is used, explaining the parts that cover the vast majority of the cases, and are doing so in a colloquial style. Good stuff.

Last edited by chaley; 09-21-2010 at 08:00 AM.
chaley is offline  
Old 09-21-2010, 03:57 AM   #14
Manichean
Wizard
Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!Manichean My eyes! My eyes! The light is just too bright!
 
Manichean's Avatar
 
Posts: 3,130
Karma: 80446
Join Date: Feb 2008
Location: Germany
Device: Cybook Gen3
First of all, thank you again for your comments. This is actually where it starts to become a learning experience for me as well.
I'll definitely go back and edit the text again, but right after reading this, a few comments:
Quote:
Originally Posted by chaley View Post
I regular-expression land (and in many programming languages), a single character is a string.
Well, the first programming language I learned was C. And if you go and define a string as char[] ... (Yes, that would contain the single character case as well, but somehow, I always think about strings as being at least two characters in length.)
Quote:
Originally Posted by chaley View Post
I know you don't want to introduce DOTALL here, but I want to make sure you know about it. If DOTALL is in effect, the dot will match line endings.
I know about flags, but as far as I know, Calibre doesn't allow for them to be used, am I right?
Quote:
Originally Posted by chaley View Post
At this point you are starting a tutorial on HTML. Do you really want to do that? Perhaps you do...
The thought here is that, judging from the posts we saw concerning the use of regexpes, at least some of the people wanting to use them have never seen HTML or anything similar. I wanted to explain what can be removed without going into any detail. I haven't decided whether to remove or rewrite this part, seeing how Calibre tries to correct broken syntax.
Quote:
Originally Posted by chaley View Post
Assuming you aren't bored with writing, a few examples would be good here. Then the next paragraph would be better situated.
In my opinion, there are way too few examples in that thing. I'll get around to that, but I'll have to try out the more non-trivial cases first, so that whatever expressions I write work. Would be pretty bad form to have broken expressions in a tutorial

By the way, concerning your comment on palindromes a while back: I think I see what you mean. I believe I've figured out how to match any palindrome of a given length not containing whitespaces (as in I couldn't match "madam im adam"), but that's about as far as I got.
Manichean is offline  
Old 09-21-2010, 04:39 AM   #15
chaley
"chaley", not "charley"
chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.
 
Posts: 5,430
Karma: 831552
Join Date: Jan 2010
Location: France
Device: Many android devices
Quote:
Originally Posted by Manichean View Post
I know about flags, but as far as I know, Calibre doesn't allow for them to be used, am I right?
You can use flags, but you must use embedded syntax. From the python docs:
Code:
(?iLmsux)

    (One or more letters from the set 'i', 'L', 'm', 's', 'u', 'x'.) The group matches the empty string; the letters set the corresponding flags: re.I (ignore case), re.L (locale dependent), re.M (multi-line), re.S (dot matches all), re.U (Unicode dependent), and re.X (verbose), for the entire regular expression. (The flags are described in Module Contents.) This is useful if you wish to include the flags as part of the regular expression, instead of passing a flag argument to the re.compile() function.
Two things to note:
1) ignore case is turned on by default, and therefore cannot be turned off.
2) (in python) the flags affect the entire expression, even if they occur later in the expression.

So, to use DOTALL to match tags split across lines (<a tags are famous for this>, I would do something like '(?s)<.*?>'. Re.M is also incredibly useful, because it allows you to use anchored expressions that match in the middle of the document. For example, '(?im)^<a.*?\/a>$' would match hyperlink tags that start at the beginning of a line, end at the end of a line, but perhaps contain line endings.
Quote:
The thought here is that, judging from the posts we saw concerning the use of regexpes, at least some of the people wanting to use them have never seen HTML or anything similar. I wanted to explain what can be removed without going into any detail. I haven't decided whether to remove or rewrite this part, seeing how Calibre tries to correct broken syntax.
I think your reasoning to keep it is correct. This is indeed what people ask about. And in any event, it is better not to break the syntax then to hope calibre fixes it up correctly.
Quote:
By the way, concerning your comment on palindromes a while back: I think I see what you mean. I believe I've figured out how to match any palindrome of a given length not containing whitespaces (as in I couldn't match "madam im adam"), but that's about as far as I got.
Yea, known length palindromes are easy, because you can use group backreferences. Dealing with the spaces is a pain, yes, but done by consuming all spaces outside the grouping parentheses.

<professorial_mode>
The general case cannot be solved with regular expressions because REs don't have the notion of 'stack'. Said another way, and getting a bit formal, all REs by definition can be translated into a deterministic finite state machine. The important part here is that the number of states is known from the RE, and is fixed for all utterances (text to be matched). Parsing utterances in a palindromic language requires a state for each letter up to the center point so the machine can match the right letter after the center point. Such a machine requires len(utterance)/2 states. Thus the number of states is unbounded, meaning that the grammar for the language cannot be described using an RE.

Because of the above problem, compilers usually use multiple grammars. One describes the input alphabet (identifiers etc) and symbols, and can often be an RE. Another describes the order of symbols, and is almost always a non-regular context-free grammar. Sometimes there is are more grammars for certain constructs or for the optimizer.
</professorial_mode>
chaley is offline  
Closed Thread

Tags
regexp calibre tutorial

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Problem with regular expressions Manichean Conversion 10 02-03-2011 02:27 PM
Custom Regular Expressions for adding book information bigbot3 Calibre 1 12-25-2010 06:28 PM
Help with Regular Expressions ghostyjack Workshop 2 01-08-2010 11:04 AM
Regular Expressions help needed Phil_C Workshop 20 10-03-2009 12:14 AM
BookDesigner v5 and regular expressions ShineOn Sony Reader 11 08-25-2008 04:06 PM


All times are GMT -4. The time now is 04:20 PM.


MobileRead.com is a privately owned, operated and funded community.