metadata download grinder

repudi8or · 05-04-2013, 08:14 AM

Hi Folks,

Calibre is great but i find the thing i do the most of is cleaning up my metadata manually after bulk importing new files....

The "download metadata and covers" tends to bring me varied results. I would say after the first pass i still have 50% remaining with no metadata for mostly the following reasons
1. author and title backwards
2. title has series in it as well
3. author and title backwards and series is hyphenated after author
3. author and title backwards and author is in some variant of url syntax
4. author name is a slight variant on that found in amazon/google
5. title field has author, title and some other crud (like file type) hyphenated in the one string

I am wondering if there is any existing plugin that will grind away at these variants until it finds a metadata match on amazon/google (etc) ??

If not, I would be willing to have a go at creating this... I have reasonable python skills and basic java skills.

My idea would be to parse author and title fields as cleverly as i could then create a search matrix based upon the most common reasons (as above) and then just grind away with the amazon and google plugins until a likely match was found. I did think that maybe generating a replacement suggestion report requiring user approval before proceeding to update the db might be a good idea. Maybe just storing the isbn of grind-matches to pull into db after user approval and then use the normal "download meta and covers" by isbn to complete the job.

Some hints at the best way to approach this (ie sensible code hook point - between "download meta" and the amazon and google plugin calls) would help.

all thoughts welcome

Regards Rep

DoctorOhh · 05-04-2013, 08:34 AM

Quote:

Originally Posted by repudi8or

Calibre is great but i find the thing i do the most of is cleaning up my metadata manually after bulk importing new files....

The "download metadata and covers" tends to bring me varied results. I would say after the first pass i still have 50% remaining with no metadata for mostly the following reasons
1. author and title backwards
2. title has series in it as well
3. author and title backwards and series is hyphenated after author
3. author and title backwards and author is in some variant of url syntax
4. author name is a slight variant on that found in amazon/google
5. title field has author, title and some other crud (like file type) hyphenated in the one string

I have no hints or suggestions, but the things you list need to be addressed at the add books phase and there are quite a few tools that exist in calibre to help get the adding books phase correct.

With data as messed up as you state it is after importing books you can't expect any download metadata plugin to correct these gross errors.

I'll be glad to discuss specifics about adding books techniques but that is a separate discussion outside of the Development forum. Since this forum is for development I will step aside and let folks that can help you in that area take the lead.

Update: There is an extract ISBN plugin that might be of help at the metadata download phase.

Good Luck.

domxch · 05-14-2013, 08:36 AM

Hi Rep,

I'm in pretty much exactly the same position.

I have a large library with some messy formats so it's difficult to parse these into Author, Titles on a single pass.

My concept is as follows:-
1. I need to identify those ebooks that successfully downloaded metadata from e.g. Google Amazon. This is in my opinion the only way to decide if the metadata is valid, if it comes from a reputable 3rd party. Even if an e-book comes with metadata then we cannot be sure that metadata is good as it may have e.g. been incorrectly entered manually.
2. For ebooks that haven't been verified as above these are the ones that need to be cleaned verified. The first step is easy - attempt to download metadata via the normal method. If successful then these books can be excluded from further processing as in 1.
3. run some iterative cleanup process on the titles/authors. This would include having a "Verified Authors" list - these would be authors attached to books that had had successful metadata downloaded. We can therefore be confident these are valid names. This list could then be used in the cleanup process to check if e.g. Title/Author should be swapped if the title matches a VerifiedAuthor. Various other logic can be implemented here to get the data into the best state for another attempt to download metadata.
4. Download metadata for "cleaned" books
5. For any failures then need to clean metadata manually or apply some further cleanup scripts then run again...

All of this can be achieved by writing a plugin except for except for Step 1. where we need to detect whether a succesful metadata download was made for each ebook - for this I raised an enhancement suggestion to add 2 new fields to the ebook table, a timestamp and status for a metadata download request. Unfortunately Kovid was't keen on the idea as there are some workarounds currently possible using virtual libraries and moving books between libraries once the metadata is clean but I'm still keen to pursue my idea above in a branch of the code.

My problem is that while I've got quite a long programming background I'm completely new to Python. I'm currently trying to get a Python Calibre project setup under PyDev in Eclipse.

If you've got interest to collaborate on this let me know...

cheers

Dom

repudi8or · 05-15-2013, 06:13 AM

Hi Dom, Sure i'd be happy to work with you on this... You might want to look at Pyscripter instead of pydev unless you are an Eclipse addict however. I find it much easier to use. I will PM you my email address.

Regards Rep

05-04-2013, 08:14 AM	#1
repudi8or Junior Member Posts: 4 Karma: 10 Join Date: Aug 2010 Device: kobo	metadata download grinder Hi Folks, Calibre is great but i find the thing i do the most of is cleaning up my metadata manually after bulk importing new files.... The "download metadata and covers" tends to bring me varied results. I would say after the first pass i still have 50% remaining with no metadata for mostly the following reasons 1. author and title backwards 2. title has series in it as well 3. author and title backwards and series is hyphenated after author 3. author and title backwards and author is in some variant of url syntax 4. author name is a slight variant on that found in amazon/google 5. title field has author, title and some other crud (like file type) hyphenated in the one string I am wondering if there is any existing plugin that will grind away at these variants until it finds a metadata match on amazon/google (etc) ?? If not, I would be willing to have a go at creating this... I have reasonable python skills and basic java skills. My idea would be to parse author and title fields as cleverly as i could then create a search matrix based upon the most common reasons (as above) and then just grind away with the amazon and google plugins until a likely match was found. I did think that maybe generating a replacement suggestion report requiring user approval before proceeding to update the db might be a good idea. Maybe just storing the isbn of grind-matches to pull into db after user approval and then use the normal "download meta and covers" by isbn to complete the job. Some hints at the best way to approach this (ie sensible code hook point - between "download meta" and the amazon and google plugin calls) would help. all thoughts welcome Regards Rep

05-14-2013, 08:36 AM	#3
domxch Junior Member Posts: 1 Karma: 10 Join Date: May 2013 Device: Android	Like to collaborate Hi Rep, I'm in pretty much exactly the same position. I have a large library with some messy formats so it's difficult to parse these into Author, Titles on a single pass. My concept is as follows:- 1. I need to identify those ebooks that successfully downloaded metadata from e.g. Google Amazon. This is in my opinion the only way to decide if the metadata is valid, if it comes from a reputable 3rd party. Even if an e-book comes with metadata then we cannot be sure that metadata is good as it may have e.g. been incorrectly entered manually. 2. For ebooks that haven't been verified as above these are the ones that need to be cleaned verified. The first step is easy - attempt to download metadata via the normal method. If successful then these books can be excluded from further processing as in 1. 3. run some iterative cleanup process on the titles/authors. This would include having a "Verified Authors" list - these would be authors attached to books that had had successful metadata downloaded. We can therefore be confident these are valid names. This list could then be used in the cleanup process to check if e.g. Title/Author should be swapped if the title matches a VerifiedAuthor. Various other logic can be implemented here to get the data into the best state for another attempt to download metadata. 4. Download metadata for "cleaned" books 5. For any failures then need to clean metadata manually or apply some further cleanup scripts then run again... All of this can be achieved by writing a plugin except for except for Step 1. where we need to detect whether a succesful metadata download was made for each ebook - for this I raised an enhancement suggestion to add 2 new fields to the ebook table, a timestamp and status for a metadata download request. Unfortunately Kovid was't keen on the idea as there are some workarounds currently possible using virtual libraries and moving books between libraries once the metadata is clean but I'm still keen to pursue my idea above in a branch of the code. My problem is that while I've got quite a long programming background I'm completely new to Python. I'm currently trying to get a Python Calibre project setup under PyDev in Eclipse. If you've got interest to collaborate on this let me know... cheers Dom

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Failed to download metadata (both metadata & cover)	EddieSean	Calibre	0	01-31-2013 09:49 PM
Why tags download only after second click on "Download metadata"?	fufu42	Library Management	2	12-08-2012 12:08 PM
[Metadata Download Plugin] Goodreads Metadata Deprecated	kiwidude	Plugins	30	04-23-2011 02:10 PM
HELP!! can't download metadata	bsell1	Calibre	10	03-11-2011 10:17 AM
Does "Download Metadata & Covers" also download social metadata?	iridius	Library Management	3	02-22-2011 12:50 PM

05-15-2013, 06:13 AM	#4
repudi8or Junior Member Posts: 4 Karma: 10 Join Date: Aug 2010 Device: kobo	Hi Dom, Sure i'd be happy to work with you on this... You might want to look at Pyscripter instead of pydev unless you are an Eclipse addict however. I find it much easier to use. I will PM you my email address. Regards Rep

Advert