View Single Post
Old 05-14-2013, 08:36 AM   #3
domxch
Junior Member
domxch began at the beginning.
 
Posts: 1
Karma: 10
Join Date: May 2013
Device: Android
Like to collaborate

Hi Rep,

I'm in pretty much exactly the same position.

I have a large library with some messy formats so it's difficult to parse these into Author, Titles on a single pass.

My concept is as follows:-
1. I need to identify those ebooks that successfully downloaded metadata from e.g. Google Amazon. This is in my opinion the only way to decide if the metadata is valid, if it comes from a reputable 3rd party. Even if an e-book comes with metadata then we cannot be sure that metadata is good as it may have e.g. been incorrectly entered manually.
2. For ebooks that haven't been verified as above these are the ones that need to be cleaned verified. The first step is easy - attempt to download metadata via the normal method. If successful then these books can be excluded from further processing as in 1.
3. run some iterative cleanup process on the titles/authors. This would include having a "Verified Authors" list - these would be authors attached to books that had had successful metadata downloaded. We can therefore be confident these are valid names. This list could then be used in the cleanup process to check if e.g. Title/Author should be swapped if the title matches a VerifiedAuthor. Various other logic can be implemented here to get the data into the best state for another attempt to download metadata.
4. Download metadata for "cleaned" books
5. For any failures then need to clean metadata manually or apply some further cleanup scripts then run again...

All of this can be achieved by writing a plugin except for except for Step 1. where we need to detect whether a succesful metadata download was made for each ebook - for this I raised an enhancement suggestion to add 2 new fields to the ebook table, a timestamp and status for a metadata download request. Unfortunately Kovid was't keen on the idea as there are some workarounds currently possible using virtual libraries and moving books between libraries once the metadata is clean but I'm still keen to pursue my idea above in a branch of the code.

My problem is that while I've got quite a long programming background I'm completely new to Python. I'm currently trying to get a Python Calibre project setup under PyDev in Eclipse.

If you've got interest to collaborate on this let me know...

cheers

Dom
domxch is offline   Reply With Quote