MobileRead Forums - View Single Post

rogue_ronin · 01-18-2010, 10:56 PM

My first take is that I think the problem comes down to the fact that the (?<author>) function has to include the comma because you have to find the beginning and the end of the name -- if there were separate functions for First and Last you could exclude the comma.

Even placing the comma in its own set via parentheses, there's no obvious way to replace it with nothing -- or filter it from the match.

Now, I'm no expert. Perhaps there's a tricky way to exclude from the return a subset of that return.

I just took a look at the Calibre regex help -- I thought maybe this would do it:

Quote:

(?:...)
A non-grouping version of regular parentheses. Matches whatever regular expression is inside the parentheses, but the substring matched by the group cannot be retrieved after performing a match or referenced later in the pattern.

I tried your more complicated regex in Calibre's test window, modifying it to "not retrieve" the comma:

Code:

^((?P<author>([^\-_0-9]+)(?:,)([^\-_0-9]+)(?=\s*-\s*)(?!\s*-\s*[0-9.]+)|\b))(\s*-\s*)?((?P<series>[^0-9\-]+)(\s*-\s*)?(?P<series_index>[0-9.]+)\s*-\s*)?(?P<title>[^\-_0-9]+)

but it returned the exact result your original regex did.

Now I realize that what it means is that in a more normal regex, the objects in such a set are not placed into the numeric variables for reuse later [ie: \1 \2 \3 or $1 $2 $3 depending on your regex flavor.] But because that comma is contained within a larger set, that larger set is returned to the label <author>.

BTW, your original regex found the Author as "Last, First" not "First Last," in the test window, so I cannot comment to its effectiveness.

m a r