MobileRead Forums - View Single Post

kiwidude · 02-07-2011, 02:27 AM

Quote:

Originally Posted by Calliastra

Starson, I really appreciate you taking the time to help!! I have a similar problem and am not sure how to tackle it. I started out with a large number of ebooks, probably about 12-15K. I imported them into Calibre and now I have almost 40K and loads of duplicates. The problem with going down the list and deleting the one or two (or more!) extras is that the DB is really bogging down. I am a very new user, but am a programmer/software tester so I understand the lingo. Can you give me a short set of instructions and then perhaps I can techwrite them into a more complete help item? From what I've googled up, it looks like this is a common question.

I will butt in since this is a topic of great interest to me currently. Firstly, have you read the Duplicate Detection thread in this forum? That discusses some changes and additions to Calibre we are in the process of making. Feedback on that thread as to what sounds useful or not is always welcomed (particularly as the plugin which will "find" duplicates has not been written yet and there's a few ways we can approach it).

As to "instructions", from a Calibre perspective Starson has given you what you need to do if you decide to try that approach. You just need to be aware of the implications:
- It will only find duplicates where the authors exactly match. There is no "fuzzy matching" on authors.
- You really have very little control over which version will be kept if you have duplicates of a format. As Starson says above it is done by order of "selection" - but if you are doing a bulk library all at once that "selection order" won't mean too much. You could maybe sort by date or something but unless you investigate each book one by one you won't know which version to keep and it could be pot luck. And if you were doing it one by one controlling selections, you wouldn't need Starson's approach and would just use Merge instead

Quote:

Part of what I am wondering if it would be worth organizing the books properly (author, title, series) or downloading metadata or any other prework that one could do that would make the duplicate matching process more effective or streamlined.

There's a few other threads in the forum if you look around at approaches people have taken. At the moment I have my own tool outside of Calibre that does fuzzy matches of authors and/or titles, doing direct sql queries against the Calibre database. Other people have their own tools/scripts, some of which were made available. Hopefully we will have a Calibre plugin soon (I've offered to write it but anyone is welcome to beat me to it), but we need to make decisions about it before I start and that discussion should be kept to the other thread.

Certainly the 1.0 version may "only" have the exact same comparison logic Starson's automerge functionality has - of exact match on author, fuzzy on title. In which case in terms of cleanup preparation getting any author dups sorted is going to greatly increase the success of any dup search on top. If you dont want to resort to sql, just use the tag browser on the left to look down your authors list and with it's alphabetical sorting you can hopefully spot a lot of the common issues like typos, initials, spacings, abbreviations of names etc. Stuff like "E.E.Doc Smith", "E. E. 'Doc' Smith", "E. E. Smith" etc etc - rename the "wrong" author variations and get them down to one.

Then at least if you decide to try Starson's described method above (not caring for instance about which EPUB to keep if you have two of them) you are in the best position to do so.

That's my 2p for what its worth.