Duplicate Detection - Page 3

kiwidude · 01-28-2011, 10:05 PM

Actually just had another thought on the "scope" of the duplicates search. Presumably you could have an option allowing you to choose "books added today", "this week", "this month", "all books" and use that as your "start set" for comparison, rather than comparing every book in your database against every other book every time...

theducks · 01-28-2011, 10:55 PM

I worry about a title "Only" matching. I have 2 or 3 'duplicate titles' that Are Not (different authors-different books.
Then there is the case of different 'Editions' of a book, when it changes publisher and get a edit job

I prefer an 'always ask' option (toss, make new entry, Merge), issues that could be held in a queue so as to not interrupt the rest of the batch and presented to the user near the end (like the current problem status, only allow browsing the library before marking what to do.

DoctorOhh · 01-28-2011, 11:27 PM

Quote:

Originally Posted by theducks

I worry about a title "Only" matching. I have 2 or 3 'duplicate titles' that Are Not (different authors-different books.

No need to worry IIRC the "title only" match has been a default of calibre for years, but if a match is found it pops up a dialog and asks you if you want to add it anyway. Of course if you have the auto merge option checked, "title only" matching is not an option and you'll never see the pop-up dialog I referred to.

What they're talking about here is well past the "title only" matching level.

theducks · 01-29-2011, 08:25 AM

Quote:

Originally Posted by dwanthny

What they're talking about here is well past the "title only" matching level.

TNX

tponzo · 01-30-2011, 10:56 AM

Quote:

Originally Posted by cybmole

ps isn't having over 30k books a tad extreme.

reading at 3 complete books per day, t'would take you over 30 years to read them all, and that's assuming you have no interest in reading anything that gets published in the next 30 years !

seems as daft as those folks who insist on putting 3000+ books onto their new Kindles then bitch about battery life & primitive collection management facilities - why would any sensible person do that...

now I will confess to having several thousand MP3 files but those songs have all been listened to, once at least. Several thousand books makes far less sense ???

My physical library consists of several thousand books. I've been reading for over 40 years and purchasing books for over 30. It's one of the reasons I purchased an e-reader. Carrying all those books around was ruining my back

My goal is to eventually have every one of those books plus whatever I buy new on my reader so 30,000 does not seem unreasonable to me.

Starson17 · 02-01-2011, 01:36 PM

Quote:

Originally Posted by kiwidude

Plan B would be to do it in a popup window as part of a GUI plugin.

I haven't had time to look at GUI plugins, so without any familiarity with them, I'd have planned to build a dialog, like the Fetch Metadata popup dialog where the results of all the searches are combined for the user to select.

Quote:

The advantage is that you could happily add columns and right-clicks all related to just the task at hand (resolving duplicates)

Exactly.

Quote:

safely encapsulated within a plugin that Kovid doesn't have to worry about

Dialog window or GUI plugin - I haven't enough experience with the latter to know if one is better or not. I find plugins to be sort of a pain to find and install.

Quote:

I would presume you must already be doing what to me is the "hard part" of using the Calibre model/db to identify duplicates for a given book.

Yes. It's just an SQL query.

Quote:

So presumably rather than iterating over a collection of "adding" books you instead iterate over "all" books.

Yes.

Quote:

Could be very slow

It seems fast enough, even on libraries of more than 15K books.

Quote:

the next step could be to "loosen the reigns" of that automerge option by adding the three sub-options I proposed and hence allowing the duplicate rows to be created when formats are duplicated.

I think you have the order wrong. Automerge is easier to play with than duplicate detection. In automerge, you have one book at a time being considered. Currently, it just checks if the automerge option is on, then does the automerge thing for each book, checking to see if there are any near dupes.

You could just as easily check one of three options stored near the automerge option, and handle all incoming books according to that option (ignore, overwrite, or add as new dupe record) or you can present that question for each book (preferably with an option to do the selected thing for all the rest of the books). It's not too hard, as each book is being handled individually.

Duplicate detection seems to me to be the harder case. All books are compared against all other books. You have to make groups of duplicates.

You may have 3 copies of book 1, two copies of book 2, 4 copies of book 3, but one of the 4 copies of book 3 isn't really a dupe and needs to be excluded from the merge, etc. I suppose you could do duplicate detection the same way - individually check each book against the entire dataset, but that would be comparable to adding the entire library to itself - that does take a lot of time.

kiwidude · 02-01-2011, 04:20 PM

Quote:

Originally Posted by Starson17

Dialog window or GUI plugin - I haven't enough experience with the latter to know if one is better or not. I find plugins to be sort of a pain to find and install.

It wasn't so much a "dialog window or plugin" choice as "popup dialog window or library view" one. What I had in mind was a right-click action called something like "Find duplicates" - which means it must be implemented as a plugin. Of course if Kovid liked the plugin enough eventually he might include it with Calibre which removes any find/install issues you allude to.

Personally I think popup dialog window will be the way to go to focus the dialog on the task at hand, custom colouring to indicate the groups of duplicate books, a few columns more useful to duplicate resolution etc. However my point was that doing that will mean a lot of functionality users may take for granted on the library view (such as customisable column displays, right-clicks for other actions etc) will not be available, initially at least.

Quote:

It seems fast enough, even on libraries of more than 15K books.

Sorry I meant that comparing every book in the entire library as a possible duplicate will be slow, which you agreed to later in the post, not that comparing one book at a time was. As I put in a later post rather than comparing "all books" all of the time, the user could be prompted to compare only a subset such as those added today, this week, month etc. Once they do an initial "all books" cleanup it could be done incrementally.

Quote:

I think you have the order wrong. Automerge is easier to play with than duplicate detection. In automerge, you have one book at a time being considered. Currently, it just checks if the automerge option is on, then does the automerge thing for each book, checking to see if there are any near dupes.

In terms of the "order" I was thinking about the find duplicates plugin as "first" for a number of reasons.
(1) if people wanted it (and Kovid etc was too busy on other things) I could develop it completely independently of any changes to Calibre source, unlike changes to automerge require.
(2) There will be many users out there who have never found or intentionally not used the automerge option and have a library with duplicates they want help with identifying
(3) Once (if) the automerge suboptions get added and a user chooses the "duplicate format" suboption, they will be creating duplicates and not have a tool to help them identify them.

Of course if you and Kovid happened to like the proposal enough to implement the automerge changes so they appeared in Calibre first, that would be just marvellous

. As you say those changes are far less work to implement.

Quote:

You could just as easily check one of three options stored near the automerge option, and handle all incoming books according to that option (ignore, overwrite, or add as new dupe record) or you can present that question for each book (preferably with an option to do the selected thing for all the rest of the books). It's not too hard, as each book is being handled individually.

Totally agree on that is what I would like to see. If automerge is off, you get prompted with a dialog per book with the three options and an ability to "apply to all". If automerge is on, it silently applies whatever suboption you specified in preferences.

Quote:

Duplicate detection seems to me to be the harder case. All books are compared against all other books. You have to make groups of duplicates.

You may have 3 copies of book 1, two copies of book 2, 4 copies of book 3, but one of the 4 copies of book 3 isn't really a dupe and needs to be excluded from the merge, etc. I suppose you could do duplicate detection the same way - individually check each book against the entire dataset, but that would be comparable to adding the entire library to itself - that does take a lot of time.

Agree again it is the harder case. It will be a fair bit of development work.

And quite frankly if it is just you and me showing any interest in the idea here it won't be very high in my priority list to implement it. I would love more people to comment on whether they think it is a flawed/bad idea, or they would love to see it in Calibre. I won't be offended if they think it's a rubbish idea - on the contrary it would save me many hours of wasted effort.

There is always "another way" - but today with Calibre your only choice for ensuring you don't accidentally throw away a better format of a book when adding is to either have automerge off (with various issues that creates) or intentionally give it a different name (requiring you to "know" it was a duplicate first).

kovidgoyal · 02-01-2011, 04:37 PM

Just so you know my current development priorities are unlikely to include working on automerge/duplicates development, so dont wait for me.

And I vote for a separate dialog for duplicate detection, but my vote is not a veto for doing it in the book list, I just think it will be cleaner to code and have more functionality in a separate dialog.

kacir · 02-01-2011, 04:44 PM

Quote:

Originally Posted by kiwidude

And quite frankly if it is just you and me showing any interest in the idea here it won't be very high in my priority list to implement it. I would love more people to comment on whether they think it is a flawed/bad idea, or they would love to see it in Calibre. I won't be offended if they think it's a rubbish idea - on the contrary it would save me many hours of wasted effort.

I think it is an absolutely fantastic idea.

I can imagine that an initial search of each book against all other books would take forever and a bit, but with large libraries users could do this in batches (marking all already checked ebooks) or let it run overnight.

The real challenge would be inventing user interface that would offer user groups of identified duplicated letting user accept or reject merging.

Also "fuzziness" of the search would have to be carefully balanced so it finds duplicates where author name and title differs somewhat Stephen_King_-_Pet_cemetery_The vs. King_s._-_The_pet_cemetery and yet it doesn't come up with too many false positives.

kiwidude · 02-01-2011, 07:22 PM

Quote:

Originally Posted by kovidgoyal

Just so you know my current development priorities are unlikely to include working on automerge/duplicates development, so dont wait for me.

Great to know, thanks Kovid. One of my fears was that you had a cunning masterplan on all this that would make the changes redundant shortly afterwards or not worth merging.

Quote:

And I vote for a separate dialog for duplicate detection, but my vote is not a veto for doing it in the book list, I just think it will be cleaner to code and have more functionality in a separate dialog.

Agreed.

If you have any further thoughts on what you would/would not want to see on this feel free to drop me an email or PM here if not on the thread. I'll be looking for further comments and feedback before I start coding anything anyways.

Starson17 · 02-03-2011, 07:57 AM

Quote:

Originally Posted by kovidgoyal

Just so you know my current development priorities are unlikely to include working on automerge/duplicates development, so dont wait for me.

I'm going to write some new automerge code. I looked at it, and it should be easy to give a basic global user selection automerge setting - a pulldown list, instead of the current on/off automerge option will allow a choice of one of three basic modes - Kovid's legacy duplicates code (checks only title), the current defualt automerge (ignore dupe formats) and the way I originally wrote it (overwrite existing with new dupe formats)

Any comments?

kovidgoyal · 02-03-2011, 10:03 AM

Fine by me

kiwidude · 02-03-2011, 03:54 PM

Quote:

Originally Posted by Starson17

I'm going to write some new automerge code. I looked at it, and it should be easy to give a basic global user selection automerge setting - a pulldown list, instead of the current on/off automerge option will allow a choice of one of three basic modes - Kovid's legacy duplicates code (checks only title), the current defualt automerge (ignore dupe formats) and the way I originally wrote it (overwrite existing with new dupe formats)

Any comments?

ummm.. I have a few questions...

(1) Kovid's legacy code is an interactive prompted option - and which I thnk it has to be if you are only matching on title. Personally I would never use it due to all the false positives from not comparing authors but fair enough if others find it useful. However my comment is are you saying it will be an "automerge" option to automatically merge on title, or an "automerge" option to not actually automerge and instead be interactively prompted?

(2) So am I right in saying your list will *not* (as yet at least) include the option that sparked this thread and several others of creating a duplicate book entry for when a duplicate format is encountered, but merge formats where they are missing?

Starson17 · 02-03-2011, 04:54 PM

Quote:

Originally Posted by kiwidude

ummm.. I have a few questions...

OK

Quote:

(1) Kovid's legacy code is an interactive prompted option - and which I thnk it has to be if you are only matching on title. Personally I would never use it due to all the false positives from not comparing authors but fair enough if others find it useful. However my comment is are you saying it will be an "automerge" option to automatically merge on title, or an "automerge" option to not actually automerge and instead be interactively prompted?

I decided to leave the current option that switches between Kovid's legacy code and a newer form of automerge. Legacy code is untouched and remains the default. I've added a combo choice box of Ignore/Overwrite/New Record for globally dealing with a duplicate incoming format if automerge is on. Everything works as it did before, except you now have global control for what should be done with duplicate incoming formats. You can have them all overwrite or all be added as a new record or all ignored. It only applies to duplicate formats. Non-duplicate formats are added to existing records.

(The interface is done, and overwrite and ignore are done - I've still got some work to do on New Record creation for duplicate records.)

Quote:

(2) So am I right in saying your list will *not* (as yet at least) include the option that sparked this thread and several others of creating a duplicate book entry for when a duplicate format is encountered, but merge formats where they are missing?

I'm not sure I understand that last part - but perhaps it's clear from the above. It doesn't offer individual book by book control, but does (will) create duplicate records.

Starson17 · 02-03-2011, 05:01 PM

Quote:

Originally Posted by kovidgoyal

Fine by me

I'm amazed you can follow all these threads and still write code!

A question: If I write the automerge option box this way:

Code:

        choices = [(_('Ignore'), 'ignore'), (_('Overwrite'), 'overwrite'),
            (_('New Record'), 'new record')]
        r('automerge', gprefs, choices=choices)

Will non-English language users be able to change the options, and will this code:

Code:

if gprefs['automerge'] == 'overwrite':

work correctly?

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Duplicate Detection	albill	Calibre	2	10-26-2010 02:21 PM
Help with Chapter detection	ubergeeksov	Calibre	0	09-02-2010 04:56 AM
Device Detection doom	Alberto Franches	Calibre	6	06-24-2010 05:38 PM
Device detection?	totanus	ePub	1	12-17-2009 07:05 AM
Structure detection v5.5 and v6.2	AlexBell	Calibre	2	07-29-2009 10:11 PM

01-28-2011, 10:05 PM	#31
kiwidude Calibre Plugins Developer Posts: 4,775 Karma: 2209206 Join Date: Oct 2010 Location: Australia Device: Kindle Oasis	Actually just had another thought on the "scope" of the duplicates search. Presumably you could have an option allowing you to choose "books added today", "this week", "this month", "all books" and use that as your "start set" for comparison, rather than comparing every book in your database against every other book every time...

01-28-2011, 10:55 PM	#32
theducks Well trained by Cats Posts: 31,576 Karma: 62544528 Join Date: Aug 2009 Location: The Central Coast of California Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A	I worry about a title "Only" matching. I have 2 or 3 'duplicate titles' that Are Not (different authors-different books. Then there is the case of different 'Editions' of a book, when it changes publisher and get a edit job I prefer an 'always ask' option (toss, make new entry, Merge), issues that could be held in a queue so as to not interrupt the rest of the batch and presented to the user near the end (like the current problem status, only allow browsing the library before marking what to do.

02-01-2011, 04:37 PM	#38
kovidgoyal creator of calibre Posts: 46,067 Karma: 29579912 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Just so you know my current development priorities are unlikely to include working on automerge/duplicates development, so dont wait for me. And I vote for a separate dialog for duplicate detection, but my vote is not a veto for doing it in the book list, I just think it will be cleaner to code and have more functionality in a separate dialog.

02-03-2011, 10:03 AM	#42
kovidgoyal creator of calibre Posts: 46,067 Karma: 29579912 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Fine by me