MobileRead Forums - View Single Post

Starson17 · 02-01-2011, 01:36 PM

Quote:

Originally Posted by kiwidude

Plan B would be to do it in a popup window as part of a GUI plugin.

I haven't had time to look at GUI plugins, so without any familiarity with them, I'd have planned to build a dialog, like the Fetch Metadata popup dialog where the results of all the searches are combined for the user to select.

Quote:

The advantage is that you could happily add columns and right-clicks all related to just the task at hand (resolving duplicates)

Exactly.

Quote:

safely encapsulated within a plugin that Kovid doesn't have to worry about

Dialog window or GUI plugin - I haven't enough experience with the latter to know if one is better or not. I find plugins to be sort of a pain to find and install.

Quote:

I would presume you must already be doing what to me is the "hard part" of using the Calibre model/db to identify duplicates for a given book.

Yes. It's just an SQL query.

Quote:

So presumably rather than iterating over a collection of "adding" books you instead iterate over "all" books.

Yes.

Quote:

Could be very slow

It seems fast enough, even on libraries of more than 15K books.

Quote:

the next step could be to "loosen the reigns" of that automerge option by adding the three sub-options I proposed and hence allowing the duplicate rows to be created when formats are duplicated.

I think you have the order wrong. Automerge is easier to play with than duplicate detection. In automerge, you have one book at a time being considered. Currently, it just checks if the automerge option is on, then does the automerge thing for each book, checking to see if there are any near dupes.

You could just as easily check one of three options stored near the automerge option, and handle all incoming books according to that option (ignore, overwrite, or add as new dupe record) or you can present that question for each book (preferably with an option to do the selected thing for all the rest of the books). It's not too hard, as each book is being handled individually.

Duplicate detection seems to me to be the harder case. All books are compared against all other books. You have to make groups of duplicates.

You may have 3 copies of book 1, two copies of book 2, 4 copies of book 3, but one of the 4 copies of book 3 isn't really a dupe and needs to be excluded from the merge, etc. I suppose you could do duplicate detection the same way - individually check each book against the entire dataset, but that would be comparable to adding the entire library to itself - that does take a lot of time.