View Single Post
Old 06-30-2013, 10:52 PM   #1
parkher
Evangelist
parkher ought to be getting tired of karma fortunes by now.parkher ought to be getting tired of karma fortunes by now.parkher ought to be getting tired of karma fortunes by now.parkher ought to be getting tired of karma fortunes by now.parkher ought to be getting tired of karma fortunes by now.parkher ought to be getting tired of karma fortunes by now.parkher ought to be getting tired of karma fortunes by now.parkher ought to be getting tired of karma fortunes by now.parkher ought to be getting tired of karma fortunes by now.parkher ought to be getting tired of karma fortunes by now.parkher ought to be getting tired of karma fortunes by now.
 
Posts: 467
Karma: 369018
Join Date: Nov 2010
Device: BL Alita/Mimas/Ares, OB Note2/Note, KA One/H2O/HD, S PRS T2/T1, PB 902
Importing books - getting all metadata

Sorry for a newbie question:

I am importing books into Calibre

Most of them have metadata information, but some do not.
Then they are imported as Unknown author, Unknown Title, and even their files renamed to unknown.
(yes, I know, the list of such books is given).


In import settings I see this

- Read metadata from file contents rather than file name.

Unfortunately, a bit different option is needed:

- Read metadata from file name only when metada from file content is not available

BTW, even better would be to use this option:
- get from file names the metadata fields that are not present in file content.

Or perhaps, if metadata in file content is often wrong or in wrong fields, the opposite option would be useful - to give priority to file names, but take other stuff from file content.

But I can find no such settings.


Anyway, I tried to get metadata from file content.
Unfortunately the ready-made regular expressions do not even cover the most popular e-book file naming convention.

So I made an improved regular expression, maybe it will be useful for others too:

(?P<author>[^_-]+) -?\s*(\[|\()?\s*(?P<series>[^_0-9\[\(-]*)(?P<series_index>[0-9]*)\s*(\]|\))?\s*-\s*(?P<title>[^_\[\(]+)\s*\[?\s*(?P<publisher>[^\]]+)\s* ?

It handles correctly most of the books that adhere to the naming conventions, even allows quite significant variety.
But there are cases that it does not cover yet:

1. when a range is given in a series:

Some Author - [Some Series 01-05] - Some Title Boxed Set (epub).rar

I don't know what to do here: series number can be only a single number,
perhaps not to enter any number and to try to move 01-05 to the end of the title?

2. when the name of a series begins with a number (this is a fake example):

Angel Devil - [666 Devil Street 01] - Devil Moves In (mobi).mobi

That is a serious complication.
Possible solution - do not support this format:

author - series nr - title
but only these:

author - [series nr] - title
author - (series nr) - title

That probably would easily allow to have series name starting with a number.
And besides, the naming convention calls for [series name] in brackets anyway.
Still, this expression covers most of everything else.

Some possible publisher info or publisher series numbers moved to publisher.
At least saved somewhere.
Again, a completely fake example:

Reeva Shmadams - The Smashing Dreamcrasher [HT-429, NHK-824] (v1.1) (epub).rar

So, in this example, I do this:

Publisher: HT-429, NHK-824

It could be changed to use a different field, but which one?
Because these are usually publisher's series with their numbers Publisher perhaps is OK.
Just to preserve this info somewhere.
Besides, sometimes there is only publisher's name in [] - so Publisher fits perfectly.

Anyway, maybe somebody would like to improve this regular expression further, or could offer suggestions what to do in those complicated cases.
Or can find other unsupported cases.

Any suggestions welcome

Last edited by parkher; 07-01-2013 at 09:31 AM.
parkher is offline   Reply With Quote