View Single Post
Old 08-16-2010, 08:02 AM   #3
chaley
Grand Sorcerer
chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.
 
Posts: 11,774
Karma: 7029857
Join Date: Jan 2010
Location: Notts, England
Device: Kobo Libra 2
I can get close, but not perfect.

Testing using the file name you supplied
Code:
Star Wars - [Boba Fett 01] - The Fight to Survive (by Terry Bisson).pdf
and assuming that *all* the files have this format, even if they do not have series, then the regular expression
Code:
(?P<series>(.+?))(?P<series_index>\d+)\] - (?P<title>.+) \(by (?P<author>.+)\)
almost works. The problem comes from the '[' and ']' characters. The series name should be the first characters up to the 2 digits followed by the ']', with the '[' removed ('Star Wars - Boba Fett'). Unfortunately, there is no way (that I know of) to remove characters within a regular expression group, so you are stuck with generating series names like 'Star Wars - [Boba Fett'.

I see two ways to deal with the extra '['. The first is to rename the files before importing to calibre. I would generate a text file containing all the book names, then edit that file to create a batch/shell script to rename the books. stripping away the leading '['.

If the batch operation isn't something you want to try, then I would run the import, then use the 'manage series' dialog available from the tag browser to manually remove the '[' character from each series. Manually correcting the series this way shouldn't take too long, around 1 to 2 seconds per series.

See the attached screenshot to see the regexp in action.
Attached Thumbnails
Click image for larger version

Name:	re.jpg
Views:	322
Size:	44.0 KB
ID:	56738  
chaley is offline   Reply With Quote