MobileRead Forums - View Single Post - Need Help With "Regular Expression Syntax"

kacir · 10-04-2010, 05:04 AM

Hi. Welcome to Mobileread.
You might find this
https://www.mobileread.com/forums/showthread.php?t=99258
thread very informative.

Quote:

Originally Posted by wladdy

(?P<author>[^-]+)
matches any string of characters except the character - and make that string the 'author' field.

Exactly.

Quote:

Originally Posted by wladdy

I don't get [^-]. Wouldn't that eliminate a possible hyphen from the name? Why not simply use (?P<author>.+), like you do for the title?

Because the + quantifier is greedy and it would eat all the characters it could, including the possible first hyphen.
So if you have string
"<b> this is bold text </b>"
and you want to get rid of the hypertext tags, you might be tempted to write the search string as "<.*>" but that would eat up everything, including the first ">"
I could have used "non-greedy" version of the + quantifier, so the RE in question might look like (?P<author>.+?).
But I am old fashioned, I learned to write Regular Expressions using tools that do not support non-greedy quantifiers, and this way the RE is more universal/compatible across various implementations of RE engine.
This RE was created under the assumption that there are no hyphens in the name of the author or name of the series. This assumption works 99.5% of time, the rest of the books (0.5%) will have to be tweaked by hand

Quote:

Originally Posted by wladdy

( - \[?(?P<series>[^-]+)(\[| )+(?P<series_index>[0-9]+)\]?)?
The whole expression between the first and the last parenthesis is followed by a question mark. Does this question mark mean that the whole expression can either not appear or appear once, thus letting us process two types of books (those with a series and those without)?

Exactly. the first and last parenthesis with the question mark make all series info optional.
I was tired of changing the RE back and forth when I needed to import book with series or without

Quote:

Originally Posted by wladdy

Again, I don't get the [^-].

Explained above

Quote:

Originally Posted by wladdy

I'm also not sure about the + in (\[| )+. Is it to process the possibility of an erroneous duplication of either a left bracket or a white space before the series index?

series number can be separated from the series name by square bracket or space or combination of those. So perhaps, we should replace + with a * to accommodate the situation when the series number is not separated by anything.
You see, I have created this Regular Expression gradually, with use. I wrote the first simple one, then I came across the book where it didn't work, so I tweaked it to work on that book as well ...
So far I have been processing books where
- author, (optional series info)?, title is separated by " - "
- the whole series or just series number might be enclosed in square bracket, so the RE is made to match also:
author - series - title
author - series 01 - title
author - [series] - title
author - [series 01] - title
author - series [01] - title
author - series [ 01] - title
and as a side effect might also match nonsense like
author - [series - title
author - [series [01 - title
author - series 01] - title
but in its current form wouldn't match
author - series01 - title
notice there is no space or square bracket separating series number

Quote:

Originally Posted by wladdy

- (?P<title>.+)
This part seems clear enough: after the last whitespace hyphen whitespace sequence, all characters are the title. However, since [^-] was used for the author and the series, why not use it here as well?

because I want to eat-up all the remaining characters, so when there IS hyphen in the title, I will get it. I was working under assumption that hyphens in title are more probable that hyphens in author name.

Quote:

Originally Posted by wladdy

Thanks!

You are welcome