MobileRead Forums - View Single Post

chaley · 08-16-2010, 08:02 AM

I can get close, but not perfect.

Testing using the file name you supplied

Code:

Star Wars - [Boba Fett 01] - The Fight to Survive (by Terry Bisson).pdf

and assuming that *all* the files have this format, even if they do not have series, then the regular expression

Code:

(?P<series>(.+?))(?P<series_index>\d+)\] - (?P<title>.+) \(by (?P<author>.+)\)

almost works. The problem comes from the '[' and ']' characters. The series name should be the first characters up to the 2 digits followed by the ']', with the '[' removed ('Star Wars - Boba Fett'). Unfortunately, there is no way (that I know of) to remove characters within a regular expression group, so you are stuck with generating series names like 'Star Wars - [Boba Fett'.

I see two ways to deal with the extra '['. The first is to rename the files before importing to calibre. I would generate a text file containing all the book names, then edit that file to create a batch/shell script to rename the books. stripping away the leading '['.

If the batch operation isn't something you want to try, then I would run the import, then use the 'manage series' dialog available from the tag browser to manually remove the '[' character from each series. Manually correcting the series this way shouldn't take too long, around 1 to 2 seconds per series.

See the attached screenshot to see the regexp in action.

08-16-2010, 08:02 AM	#3
chaley Grand Sorcerer Posts: 11,774 Karma: 7029857 Join Date: Jan 2010 Location: Notts, England Device: Kobo Libra 2	I can get close, but not perfect. Testing using the file name you supplied Code: Star Wars - [Boba Fett 01] - The Fight to Survive (by Terry Bisson).pdf and assuming that all the files have this format, even if they do not have series, then the regular expression Code: (?P<series>(.+?))(?P<series_index>\d+)\] - (?P<title>.+) \(by (?P<author>.+)\) almost works. The problem comes from the '[' and ']' characters. The series name should be the first characters up to the 2 digits followed by the ']', with the '[' removed ('Star Wars - Boba Fett'). Unfortunately, there is no way (that I know of) to remove characters within a regular expression group, so you are stuck with generating series names like 'Star Wars - [Boba Fett'. I see two ways to deal with the extra '['. The first is to rename the files before importing to calibre. I would generate a text file containing all the book names, then edit that file to create a batch/shell script to rename the books. stripping away the leading '['. If the batch operation isn't something you want to try, then I would run the import, then use the 'manage series' dialog available from the tag browser to manually remove the '[' character from each series. Manually correcting the series this way shouldn't take too long, around 1 to 2 seconds per series. See the attached screenshot to see the regexp in action. Attached Thumbnails