Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Development

Notices

Reply
 
Thread Tools Search this Thread
Old 05-04-2013, 08:14 AM   #1
repudi8or
Junior Member
repudi8or began at the beginning.
 
Posts: 4
Karma: 10
Join Date: Aug 2010
Device: kobo
metadata download grinder

Hi Folks,

Calibre is great but i find the thing i do the most of is cleaning up my metadata manually after bulk importing new files....

The "download metadata and covers" tends to bring me varied results. I would say after the first pass i still have 50% remaining with no metadata for mostly the following reasons
1. author and title backwards
2. title has series in it as well
3. author and title backwards and series is hyphenated after author
3. author and title backwards and author is in some variant of url syntax
4. author name is a slight variant on that found in amazon/google
5. title field has author, title and some other crud (like file type) hyphenated in the one string

I am wondering if there is any existing plugin that will grind away at these variants until it finds a metadata match on amazon/google (etc) ??

If not, I would be willing to have a go at creating this... I have reasonable python skills and basic java skills.

My idea would be to parse author and title fields as cleverly as i could then create a search matrix based upon the most common reasons (as above) and then just grind away with the amazon and google plugins until a likely match was found. I did think that maybe generating a replacement suggestion report requiring user approval before proceeding to update the db might be a good idea. Maybe just storing the isbn of grind-matches to pull into db after user approval and then use the normal "download meta and covers" by isbn to complete the job.

Some hints at the best way to approach this (ie sensible code hook point - between "download meta" and the amazon and google plugin calls) would help.

all thoughts welcome

Regards Rep
repudi8or is offline   Reply With Quote
Old 05-04-2013, 08:34 AM   #2
DoctorOhh
US Navy, Retired
DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.
 
DoctorOhh's Avatar
 
Posts: 9,864
Karma: 13806776
Join Date: Feb 2009
Location: North Carolina
Device: Icarus Illumina XL HD, Nexus 7
Quote:
Originally Posted by repudi8or View Post
Calibre is great but i find the thing i do the most of is cleaning up my metadata manually after bulk importing new files....

The "download metadata and covers" tends to bring me varied results. I would say after the first pass i still have 50% remaining with no metadata for mostly the following reasons
1. author and title backwards
2. title has series in it as well
3. author and title backwards and series is hyphenated after author
3. author and title backwards and author is in some variant of url syntax
4. author name is a slight variant on that found in amazon/google
5. title field has author, title and some other crud (like file type) hyphenated in the one string
I have no hints or suggestions, but the things you list need to be addressed at the add books phase and there are quite a few tools that exist in calibre to help get the adding books phase correct.

With data as messed up as you state it is after importing books you can't expect any download metadata plugin to correct these gross errors.

I'll be glad to discuss specifics about adding books techniques but that is a separate discussion outside of the Development forum. Since this forum is for development I will step aside and let folks that can help you in that area take the lead.

Update:
There is an extract ISBN plugin that might be of help at the metadata download phase.

Good Luck.

Last edited by DoctorOhh; 05-04-2013 at 08:37 AM.
DoctorOhh is offline   Reply With Quote
Advert
Old 05-14-2013, 08:36 AM   #3
domxch
Junior Member
domxch began at the beginning.
 
Posts: 1
Karma: 10
Join Date: May 2013
Device: Android
Like to collaborate

Hi Rep,

I'm in pretty much exactly the same position.

I have a large library with some messy formats so it's difficult to parse these into Author, Titles on a single pass.

My concept is as follows:-
1. I need to identify those ebooks that successfully downloaded metadata from e.g. Google Amazon. This is in my opinion the only way to decide if the metadata is valid, if it comes from a reputable 3rd party. Even if an e-book comes with metadata then we cannot be sure that metadata is good as it may have e.g. been incorrectly entered manually.
2. For ebooks that haven't been verified as above these are the ones that need to be cleaned verified. The first step is easy - attempt to download metadata via the normal method. If successful then these books can be excluded from further processing as in 1.
3. run some iterative cleanup process on the titles/authors. This would include having a "Verified Authors" list - these would be authors attached to books that had had successful metadata downloaded. We can therefore be confident these are valid names. This list could then be used in the cleanup process to check if e.g. Title/Author should be swapped if the title matches a VerifiedAuthor. Various other logic can be implemented here to get the data into the best state for another attempt to download metadata.
4. Download metadata for "cleaned" books
5. For any failures then need to clean metadata manually or apply some further cleanup scripts then run again...

All of this can be achieved by writing a plugin except for except for Step 1. where we need to detect whether a succesful metadata download was made for each ebook - for this I raised an enhancement suggestion to add 2 new fields to the ebook table, a timestamp and status for a metadata download request. Unfortunately Kovid was't keen on the idea as there are some workarounds currently possible using virtual libraries and moving books between libraries once the metadata is clean but I'm still keen to pursue my idea above in a branch of the code.

My problem is that while I've got quite a long programming background I'm completely new to Python. I'm currently trying to get a Python Calibre project setup under PyDev in Eclipse.

If you've got interest to collaborate on this let me know...

cheers

Dom
domxch is offline   Reply With Quote
Old 05-15-2013, 06:13 AM   #4
repudi8or
Junior Member
repudi8or began at the beginning.
 
Posts: 4
Karma: 10
Join Date: Aug 2010
Device: kobo
Hi Dom, Sure i'd be happy to work with you on this... You might want to look at Pyscripter instead of pydev unless you are an Eclipse addict however. I find it much easier to use. I will PM you my email address.

Regards Rep
repudi8or is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Failed to download metadata (both metadata & cover) EddieSean Calibre 0 01-31-2013 09:49 PM
Why tags download only after second click on "Download metadata"? fufu42 Library Management 2 12-08-2012 12:08 PM
[Metadata Download Plugin] Goodreads Metadata **Deprecated** kiwidude Plugins 30 04-23-2011 02:10 PM
HELP!! can't download metadata bsell1 Calibre 10 03-11-2011 10:17 AM
Does "Download Metadata & Covers" also download social metadata? iridius Library Management 3 02-22-2011 12:50 PM


All times are GMT -4. The time now is 04:54 PM.


MobileRead.com is a privately owned, operated and funded community.