Duplicate detection plugin

kiwidude · 04-10-2011, 02:12 PM

Call me a masochist, but I can't resist the challenge of getting something up and running for this. I've tried to say no, but I can't. It is a disease.

We had a bunch of really good discussions in this on the Duplicate Detection thread - Starson17 & Kovid implemented the automerge changes which occupied the first four pages, then the posts more relevant to the plugin begain in earnest with some good stuff being thrown around. I considered continuing this post on that thread, but it is clear it is now all about plugin design decisions and has no place in the library mgmt forum.

So having reviewed the thread, I have had a few thoughts.

Re-using the library view is a no-brainer for me now, lets keep the plugin as simple as possible and let the user retain all the custom columns, sorting, right-clicks etc they have today to decide on how to resolve a merge.
There was much discussion about reviewing results by "duplicate set" versus "by book that has duplicates". There seemed pros and cons with both. The immediate issue I have with the latter is how you can visually indicate in the gui which book is the "root" that you are considering, since you cannot rely on sorting. The only way I could think of would be a search restriction on that subset of books, then using highlighting to mark the root book in the results. But that could be messy for the user to exit out of. Unless you have other ideas I will start off with just set based result reviewing for now and we can add the other later if desired.
The plugin will display one duplicate "result" at a time (be it a set or book based), and via a right-click/keyboard shortcut move to the next result.
When a user chooses the "Find duplicates" option they get a popup dialog letting them choose the algorithm to apply (assuming more than one available eventually). Eventually they may also get the option of whether to review the results by set or by book and other "stuff" like automerging.
As the user can merge results as they go, the "next result" navigation will need to be fairly dynamic. I think the duplicate result sets should be held in memory by the plugin, rather than persisted into a custom column with a group number. We now have the "marked" feature to be able to display books matching a criteria. So when a user moves to the next result, I can grab the next set from memory, check whether any of the books have been deleted (resulting from a merge or delete), and if still have more than one then "mark" the books and display in the GUI. Simples.
Taking this approach will mean that if they restart Calibre again they must run their dup search again to rebuild the in-memory result sets. However as we anticipate this being fairly quick I don't forsee any issues with that.
I like Starson17's suggestion of a custom tags column that stores some sort of alg01-bookid1-bookid2 combination when a user right-clicks to indicate a false positive. Presumably you would only need to store this on the book with the lowest id if the dup finding algorithms also sorts by ids. There is the issue of what if the user subsequently merges that book into bookid4? I think in that situation the false positive record should become invalid? Without this being completely integrated into the merge code of Calibre, I think the dup checking should just take responsiblity for clearing out such "dud" false positive entries when it finds them.
To begin with I will only run one algorithm at a time, not multiple at once. So a user would run the "exact author, fuzzy title" algorithm (i.e. automerge) and have a pretty high confidence that they will be able to merge results found. Then after reviewing their results they would run the "fuzzy author, fuzzy title" algorithm or whatever, and know that they may get a lot of results which will require far more careful identification of false positives. Mixing the results together particularly if we do a book based review could get horribly confusing?
Due to all the complexities of choosing "who wins" I don't want to actually perform the merge or offer that as an option at this point - it is up to the user to go through each set and do the merge right-clicks, move to the next set etc. Maybe down the road there might be some kind of "automerge" suboption based on choices like "oldest is the master" or whatever but will avoid this initially at least.

You guys have any further thoughts on this before I take the plunge?

kiwidude · 04-11-2011, 04:08 AM

A couple more thoughts on the false positive stuff:

Why is it that it was thought necessary to include the algorithm when storing a pair as a false positive match? Surely if I say that books 1 and 2 are not duplicates, then that should be it as far as that pair are concerned. If they are found for a Fuzzy Title/Exact Author search, then the same pair will come up again when I do a Fuzzy Title/Fuzzy Author search. So I would have to exclude them again, and again with every new algorithm which sounds not the point of indicating a false positive in the first place. The only case that I can think of which perhaps Starson17 mentioned is if you added an "ISBN match" algorithm, and it found that 1 and 2 had the same ISBN. This may indicate a problem with your metadata, but as this plugin is about resolving duplicates and you have already said Foo and Bar are not duplicates then still they should not appear imho. I actually think "ISBN match" should be a feature in the "Quality Check" plugin, that would help you identify to fix a 1/2 ISBN match after you had indiciated they should not appear as duplicates from false positives.
If we allow book based review, you could get into a mess with false positives. So if (1,2) and (1,3) are found, your review book 1 which displays (1,2,3). The user out of confusion could select (2,3) and say "Mark as not duplicates". However of course (2,3) was never considered as a duplicate in the first place, and if they run the search again without resolving (1,2) or (1,3) then 2 & 3 are still going to be displayed on screen in a book based review. Perhaps the "Mark as not duplicates" menu item would need some validation over being enabled to make sure that the selection only covered a set the plugin identified as a duplicate. If the root (1 in this case) was visually highlighted then the user would have a better visual queue that they should only mark a pair as not duplicates if that visually highlighted root item is one of them. Again if doing set based review this issue is not relevant.

chaley · 04-11-2011, 08:22 AM

We must first decide what it means to say two books are not duplicates. My position is that books marked as not duplicates are not duplicates, independent of what algorithm is used. Note that the 'not duplicate' relationship is not transitive: if books 1 & 2 are not duplicates and books 2 & 3 are not duplicates, we can say nothing about whether books 1 and 3 are duplicates.

I think that the natural implementation is to have pairs of books that are known to be 'not duplicates'. When sets of duplicates are built, they must be partitioned so that pairs of 'not duplicates' are never in the same set. Using the above example, if the algorithm produces a set [1, 2, 3], then partitioning will produce a set [1, 3]. If we have a duplicate set [1, 2, 3, 4] with not-duplicate [1, 2] and [3, 4], then the result will be four sets [1, 3], [1, 4], [2, 3], [2, 4]. The algorithm I posted before does this partitioning.

I am not convinced that a custom column is the best place to keep the 'not duplicate' pairs. First, the data would be duplicated in both books, creating a maintenance problem (change one means the other must change). Second, the data must be 'multiple' because a book can participate in multiple pairs. I think I am saying that the plugin must accept a selection of books that are not dups and write that information to its own storage.

As for whether or not to use a persistent column for the results: I tend toward persistence. I see three advantages of using a custom column. 1) I can process a group, then delete the group marker to indicate I am done. Running the algorithm again could find these books again, unless I am meticulous about marking not-dups. 2) I can do the work over time, knowing where I stopped. 3) I can select all the groups that a given book is in, giving a books view.

For this last one, the plugin could help by writing the transitive group information into a custom column. This would solve the sorting problem because the values written could be sortable: the base book could be #1, with the rest of the books being some other number. The problem you raise regarding books 2 and 3 'not duplicates' is also real, but I think it is a consequence of using book mode. Book-based browsing says something about a relationship between the base book and any other book, but nothing about the relationship between books that are not the base.

Have fun!

kiwidude · 04-11-2011, 09:42 AM

Hey Charles, thanks for the replies.

I don't think we differ at all on thoughts on how the data should be partitioned, and that a set based approach to viewing the results has a number of advantages over book based in terms of simplicity of the UI etc.

Where to store the duplicate pair exemption is an interesting one. Sounds like we are agreed that the algorithm is not relevant to the pairing. The maintenance aspect is a point I touched on in a slightly different fashion, I don't believe you would need to store it on both books if you process in book id order but there is certainly an issue when one of the books in the pair becomes merged or deleted. Now we have library uuid it is more feasible to store library specific ids in a config file so I'll add that to the mix.

I'm not sure I understand all your persistence reasons, as surely similar justification for not using a custom column to store exemptions applies to storing groups? Take for instance the example of (1,2) and (1,3) being found. We display (1,2). Now if the user goes "next group" without doing anything, we will want to display (1,3). If the user merges book 2 into book 1 before going next group, we will still want to display (1,3). However if they merge book 1 into book 2, then the (1,3) 2nd group is invalidated. If that information has been persisted into a custom column, the plugin potentially will be doing a lot of repeated querying to find a next group, validate all the ids, clearing the persisted column if group is not valid, move to next group etc. I would have thought that might be a little easier to do when not persisted in the database?

For point (3) of viewing groups a book is in, you can still do that. You would always have to rely on some kind of in-memory mapping the plugin kept of groups and their members. So I don't see that reason as a differentiator - in fact it could well be easier just using in-memory results because you don't have to re-read all the custom group ids and re-build the mappings as you would if relying on the persisted values?

I think the point (2) of doing the resolving across Calibre sessions is absolutely valid, though I think it comes down to the workflow again and how long it takes to run the duplicate search. For me I either resolve the pair in some way or else it will come up again the next time I run the search. Whether I do that now or the next time I open Calibre (and trigger another search) to me doesn't matter, that pair still needs to be resolved in some way. But of course thats just my opinion

.

I think at this point I will still go for the "marked" memory based approach and see how well it works. Converting it to use persisted custom columns at a later point should only involve adding to the logic I would have thought, not having to bin everything. If I also store the false positives in the config file then that means I can avoid custom columns completely. I would like to avoid them if I can, as it just avoids the issues of column visibility, deletions, attempted sorting, naming conflicts, users fiddling with values etc.

It is a good point about using a custom column for transitive based sorting. Though unless you forced that column visible on the view one stray click by the user and the sorting is lost, they would like have to do "next group, previous group" to get it restored. However at least for the first cut since transitive groups are not being displayed I think I can again avoid this issue for now anyways.

chaley · 04-11-2011, 10:17 AM

Quote:

Originally Posted by kiwidude

I'm not sure I understand all your persistence reasons, as surely similar justification for not using a custom column to store exemptions applies to storing groups? Take for instance the example of (1,2) and (1,3) being found. We display (1,2). Now if the user goes "next group" without doing anything, we will want to display (1,3). If the user merges book 2 into book 1 before going next group, we will still want to display (1,3). However if they merge book 1 into book 2, then the (1,3) 2nd group is invalidated. If that information has been persisted into a custom column, the plugin potentially will be doing a lot of repeated querying to find a next group, validate all the ids, clearing the persisted column if group is not valid, move to next group etc. I would have thought that might be a little easier to do when not persisted in the database?

I assume that one stores the group ID in the custom column, not pair relationships. I also assume that if one book is merged into another, then one of those two books is deleted.

If both of these assumptions are true, then your examples don't create a problem. In the first case (2 -> 1), book 1 will still have all its groups, so next_group will show (1,3). If 1 is merged into 2, then 1 no longer exists and next_group will show only book 3.

Quote:

For point (3) of viewing groups a book is in, you can still do that. You would always have to rely on some kind of in-memory mapping the plugin kept of groups and their members. So I don't see that reason as a differentiator - in fact it could well be easier just using in-memory results because you don't have to re-read all the custom group ids and re-build the mappings as you would if relying on the persisted values?

I think we must have a different mental model of what is written in the custom column. I am assuming that the CC is a multiple column that is filled in with the group ID of each group the book is a member of. I don't see why I need to rebuild anything. If I search for group 34 (which I can do from the tag browser), I get all the books in group 34. I can resolve duplicates manually by removing a book from a group. The plugin could help with that, if it knows what group I am working on.

I also assume that telling the plugin that two books are not duplicates does not rerun the partitioning algorithm. The new non-dup info would be used next time I run it.

Quote:

I think the point (2) of doing the resolving across Calibre sessions is absolutely valid, though I think it comes down to the workflow again and how long it takes to run the duplicate search. For me I either resolve the pair in some way or else it will come up again the next time I run the search.

I agree

Quote:

Whether I do that now or the next time I open Calibre (and trigger another search) to me doesn't matter, that pair still needs to be resolved in some way.

I can easily imagine running the check, scanning the results over time, and merging/fixing any problems without doing any other resolution. Next time I run the check (which could be in months), I see the same results. If I get annoyed enough, I would go to the trouble of indicating that books in a group are not in fact duplicates.

Quote:

It is a good point about using a custom column for transitive based sorting. Though unless you forced that column visible on the view one stray click by the user and the sorting is lost, they would like have to do "next group, previous group" to get it restored. However at least for the first cut since transitive groups are not being displayed I think I can again avoid this issue for now anyways.

I assumed the column was visible.

kiwidude · 04-11-2011, 10:35 AM

Hey Charles, no we are on the same page when it comes to what is stored in the CC..

What I thought would not be useful as a user though is to get shown a duplicate group that had only one entry in it when I go "next result". So just displaying book 3 by itself wouldn't give me a high level of confidence the plugin was "working right", even though we know the reasons why there is just now one in the group

. It is that extra processing behind the scenes to invalidate that group etc that was my concern of the CC versus in-memory approach. For instance it is going to be impossible to give the user an updated number of how many more "real" duplicate groups there are, unless you go through and invalidate them every time they hit next result, to remove all deleted books and reset the group counts. Something that could be done in memory less expensively/easily I would argue than via a CC. Because there are no hooks for this plugin to know that a user has deleted a book or merged a book, it must check each id in the duplicates space every time you hit next to give that count.

You are of course correct that you wouldn't need to rebuild the mappings to display the book view so thanks for correcting me on that. Brain fart.

And about making a column visible. Yeah that's where you get into the problem of a column being there that you only want while you are resolving duplicates. And the myriad of issues about ways to hide it etc. If you get sidetracked while resolving and want to do other queries/restart Calibre etc and have this column in your view that "annoys" when not being used. If I can avoid it then I would prefer to

chaley · 04-11-2011, 10:43 AM

Quote:

Originally Posted by kiwidude

What I thought would not be useful as a user though is to get shown a duplicate group that had only one entry in it when I go "next result". So just displaying book 3 by itself wouldn't give me a high level of confidence the plugin was "working right", even though we know the reasons why there is just now one in the group

.

If the user is using the plugin's next group function, then all you need to do is a search_getting_ids and display the group iff the count is greater than 1. If the user is using the tag browser, then they see the counts change as they manually resolve issues.

The above aside, as you are doing the work, your vote is worth more than mine.

In addition, given the quality of all your other work, I am sure that whatever you do will do the job admirably. Have at it!

kiwidude · 04-11-2011, 10:56 AM

Haha, dunno about the quality part mate, I've had a couple of crappy bugs with all the plugin rewrites recently, trying to context switch too often and Eclipse/Pydev has decided to not be of any use to me any more at indicating missing imports etc. May as well be writing in Notepad again. Might be time to try Aptana which John says has less issues.

I enjoy the debate to verify I'm not wandering too far down the wrong path. You guys inevitably suggest things I haven't considered. For instance just now you mentioned the tag browser. Using a custom column would of course let you "navigate" your groups that way, and as you say give you an ongoing count of the groups which is something I hadn't figured out how to do other than via the status bar. I think that the group count is invalid unless you go through and "prune" (clear) group ids for books that no longer have more than one member though, and there has to be some action by the user to trigger that unless it is hooked into delete events (which as I discovered with a previous plugin and discussed here before don't get fired by database2.py anyways with that notify stuff).

But a starting point is better than nothing, and I reckon 90% of it should be reusable should we steer down another path once people start feeding back on usability etc.

kiwidude · 04-11-2011, 08:18 PM

Ok, here is the progress so far. The big ticket missing at this point is handling false positives. However the rest of it is pretty much there. So far I've put in five different algorithms:

ISBN exact match
Ignore author, exact title (Calibre add default logic)
Ignore author, similar title
Exact author, similar title (Calibre automerge logic)
Similar author, similar title (Author using Kovid's new metadata logic)

It returns the matches, and you can navigate through them either by clicking on the toolbar button (ctrl+click on the button for previous) or using keyboard shortcuts (ctrl+\ for next result, ctrl+shift+\ for previous). Groups are dynamically removed every time you move to the next result.

False positives are my next issue. The first part which perhaps Chaley has thoughts on is deciding how to store them. For instance the simple case is that a user marks (1,2) as not dups, then (1,3) as not dups. So I could store those as tuples in a list for the library like [(1,2),(1,3)]

However in theory we could allow the user to select more than 2 rows. For instance they select (1,2,3). Now what does that actually "mean". Presumably the net effect is indicating [(1,2),(1,3),(2,3)]. Do we break it down and store it that way, or do we store it as [(1,2,3)]?

Now what happens if the user then selects (1,2,3,4)? As obviously that is a superset. Or what if they select (1,2,5)? I haven't got my head around whether we should flatten this out yet etc.

Presumably we also need to cater for the user screwing up and allowing them to "undo" a no duplicate set. And how could we show the user visually what duplicate exemptions they have in place? That might need a special screen being built for it I think (and let them delete exemptions from that screen).

Feedback as always appreciated, be it on the code, usability or whatever. So far it is fairly simple I think.

tomd365 · 04-11-2011, 09:11 PM

I used this tonight on my collection and it worked very well. If the program found more than two matching books, it would hang after merging. Other than that it worked great - what a timesaver.

kiwidude · 04-12-2011, 02:55 AM

@tomd365 - thanks for trying this and the feedback. The "hanging" sounds a concern though. I have made two changes - one was a fix to the logic doing the cleaning up when you move to the next book, and the second is an improvement to the "xxx author, similar title" algorithms to strip whitespace from the fuzzied result, so that titles of "Manichu" and "Manichu -" would end up in the same group.

If you can give me some exact steps to replicate any hanging you experienced it would be much appreciated. It wasn't very clear from your post. Did you mean more than two matching groups, or more than two books in a group? If the latter - how many groups, just one or multiple? What merges did you do - all of them required to get down to one record or just a couple? When did it actually "hang" - when you did next record, while you were merging or when you ran the duplicate check again?

chaley · 04-12-2011, 07:18 AM

Quote:

Originally Posted by kiwidude

False positives are my next issue. The first part which perhaps Chaley has thoughts on is deciding how to store them. For instance the simple case is that a user marks (1,2) as not dups, then (1,3) as not dups. So I could store those as tuples in a list for the library like [(1,2),(1,3)]

However in theory we could allow the user to select more than 2 rows. For instance they select (1,2,3). Now what does that actually "mean". Presumably the net effect is indicating [(1,2),(1,3),(2,3)]. Do we break it down and store it that way, or do we store it as [(1,2,3)]?

Now what happens if the user then selects (1,2,3,4)? As obviously that is a superset. Or what if they select (1,2,5)? I haven't got my head around whether we should flatten this out yet etc.

The answer to the question "how do I store" will depend on how you process them and how you do partitioning. For example, the algorithm that I posted a while back wants a map of sets, mapping an ID to a set of IDs that are known not duplicates. This map would be a good way to save the information, in which case when given a list of not-dups, you would iterate over that list adding all the members to each set.

Code:

  for bk in not_dups:
    nd_map[bk] |= not_dups

Whether or not you trim bk out of not_dups depends (again) on whether the algorithm is sensitive to the set containing itself.

Editing the not_dups list is mostly a presentation issue. The easiest way to do it (but perhaps not the easiest way to use it) would be to select a book and see the list of not_dups for that book (the value of nd_map[bk]). The user could check (or del or something) any book from that set, say with id bk2. The code would do:

Code:

for bk2 in checked_books:
  nd_map[bk] -= set([bk2])
  nd_map[bk2] -= set([bk])

The tricky thing is transitivity. If I say (1,2) are not_dups, then I say (1,3) are not_dups, the algorithm cannot generate (2,3) as a consequence. The map approach respects this restriction, as it never does a closure. To use your example and the above approach: (1,2,3,4) would generate
{ 1:(1,2,3,4), 2:(1,2,3,4), 3:(1,2,3,4), 4:(1,2,3,4) }
Then adding (1,2,5) would result in:
{ 1:(1,2,3,4,5), 2:(1,2,3,4,5), 3:(1,2,3,4), 4:(1,2,3,4), 5:(1,2,5) }

kiwidude · 04-12-2011, 10:32 AM

Hey Charles, thanks for the comments.

Yeah I was rather tired last night when I posted but wanted to ask for input before I crashed. I had come to the same conclusion of a dictionary by book of exclusions but wanted to make sure I wasn't missing some obvious alternative.

Part of my hesitation was thinking about ways users could mark not duplicates. For instance if I let it be a free-for-all, then in it could be a lot of data. Say for instance I allowed a user to mark any books they liked together as not duplicates. So I have a library of 1,000 books as a starting point, and I am confident I have no duplicates. If I selected the whole lot and said mark as not duplicates, then I would ensure that any duplicates that came up in future were not books I had considered. I could skip through the book groups with a very "loose" duplicate matching algorithm, then rather than resolving each group I could just resolve the ones I were dups, then select my whole library and mark it as not containing duplicates again.

From a user perspective this is kind of simple. From a data perspective it is a bit of a disaster, storing a cross-map table of every id with every other id

So to prevent that scenario, I would instead only allow a user to mark as not duplicates books in the found sets - if the selection you make is not found together in the same set with the search algorithm you can't add them as a duplicate. That would keep the data volumes down, at the expense of more clicks required by the user.

Perhaps a compromise would be an additional menu option of something like "mark all groups as not duplicates". So a user could still skip through their duplicate results found, and then if they are happy they never want to see those again they can with one click add the various pairings within those groups as marked pairs. That would be dramatically less data while minimising the user effort. Does that sound useful?

As for the presentation, yeah need to think about that one. It has the same issues (I think) as a book based view. You have a root book, and you have the related books marked as not duplicates. From a user perspective you perhaps initially want to see the set of all books marked as not duplicates, then when you click on one be able to see what books it is duplicates with. As you say the whole nasty issue of transitivity comes up when you visually display that information to consider the pairings.

Maybe the easiest way is to only allow the user to remove all exceptions for a book. So we have a menu option of "Show marked not duplicates" which displays every book that has some kind of a not duplicate relationship. Then a user can choose a book and choose "Remove from not duplicates" which breaks all associations other books have with it. So if I have (1,2) and (1,3) marked as not duplicates, I choose "Show" which will display 1, 2, 3. If I choose "Remove" on book 1, then both it's exclusion relationship with 2 and with 3 are deleted. If 2 & 3 have no exclusions with other books then they also get removed from the exclusion dictionary, so when it refreshes the view it would display no results.

So if a user wanted to remove just the (1,2) relationship, they would need to do the remove on book 2. Then when it refreshes it would still be showing book 1 and book 3. Without having some visual indication of the relationship pair graph between the books I think it is the simplest approach.

Any thoughts/objections?

Starson17 · 04-12-2011, 10:45 AM

Quote:

Originally Posted by kiwidude

Feedback as always appreciated, be it on the code, usability or whatever. So far it is fairly simple I think.

FYI, all options in Find Duplicates give me a zero division error.

Spoiler:

This is my small test library and has a lot of funky data from hundreds of odd tests.

kiwidude · 04-12-2011, 11:08 AM

Hi Starson17 - ahhh, it must be a very small library (less than 20 books perhaps?). I borrowed some code from my quality check plugin and it would suffer from the same issue. I will stick a new version up shortly.

04-11-2011, 04:08 AM	#2
kiwidude Calibre Plugins Developer Posts: 4,637 Karma: 2162064 Join Date: Oct 2010 Location: Australia Device: Kindle Oasis	A couple more thoughts on the false positive stuff: Why is it that it was thought necessary to include the algorithm when storing a pair as a false positive match? Surely if I say that books 1 and 2 are not duplicates, then that should be it as far as that pair are concerned. If they are found for a Fuzzy Title/Exact Author search, then the same pair will come up again when I do a Fuzzy Title/Fuzzy Author search. So I would have to exclude them again, and again with every new algorithm which sounds not the point of indicating a false positive in the first place. The only case that I can think of which perhaps Starson17 mentioned is if you added an "ISBN match" algorithm, and it found that 1 and 2 had the same ISBN. This may indicate a problem with your metadata, but as this plugin is about resolving duplicates and you have already said Foo and Bar are not duplicates then still they should not appear imho. I actually think "ISBN match" should be a feature in the "Quality Check" plugin, that would help you identify to fix a 1/2 ISBN match after you had indiciated they should not appear as duplicates from false positives. If we allow book based review, you could get into a mess with false positives. So if (1,2) and (1,3) are found, your review book 1 which displays (1,2,3). The user out of confusion could select (2,3) and say "Mark as not duplicates". However of course (2,3) was never considered as a duplicate in the first place, and if they run the search again without resolving (1,2) or (1,3) then 2 & 3 are still going to be displayed on screen in a book based review. Perhaps the "Mark as not duplicates" menu item would need some validation over being enabled to make sure that the selection only covered a set the plugin identified as a duplicate. If the root (1 in this case) was visually highlighted then the user would have a better visual queue that they should only mark a pair as not duplicates if that visually highlighted root item is one of them. Again if doing set based review this issue is not relevant.

04-11-2011, 08:22 AM	#3
chaley Grand Sorcerer Posts: 11,742 Karma: 6997045 Join Date: Jan 2010 Location: Notts, England Device: Kobo Libra 2	not duplicates We must first decide what it means to say two books are not duplicates. My position is that books marked as not duplicates are not duplicates, independent of what algorithm is used. Note that the 'not duplicate' relationship is not transitive: if books 1 & 2 are not duplicates and books 2 & 3 are not duplicates, we can say nothing about whether books 1 and 3 are duplicates. I think that the natural implementation is to have pairs of books that are known to be 'not duplicates'. When sets of duplicates are built, they must be partitioned so that pairs of 'not duplicates' are never in the same set. Using the above example, if the algorithm produces a set [1, 2, 3], then partitioning will produce a set [1, 3]. If we have a duplicate set [1, 2, 3, 4] with not-duplicate [1, 2] and [3, 4], then the result will be four sets [1, 3], [1, 4], [2, 3], [2, 4]. The algorithm I posted before does this partitioning. I am not convinced that a custom column is the best place to keep the 'not duplicate' pairs. First, the data would be duplicated in both books, creating a maintenance problem (change one means the other must change). Second, the data must be 'multiple' because a book can participate in multiple pairs. I think I am saying that the plugin must accept a selection of books that are not dups and write that information to its own storage. As for whether or not to use a persistent column for the results: I tend toward persistence. I see three advantages of using a custom column. 1) I can process a group, then delete the group marker to indicate I am done. Running the algorithm again could find these books again, unless I am meticulous about marking not-dups. 2) I can do the work over time, knowing where I stopped. 3) I can select all the groups that a given book is in, giving a books view. For this last one, the plugin could help by writing the transitive group information into a custom column. This would solve the sorting problem because the values written could be sortable: the base book could be #1, with the rest of the books being some other number. The problem you raise regarding books 2 and 3 'not duplicates' is also real, but I think it is a consequence of using book mode. Book-based browsing says something about a relationship between the base book and any other book, but nothing about the relationship between books that are not the base. Have fun!

04-11-2011, 08:18 PM	#9
kiwidude Calibre Plugins Developer Posts: 4,637 Karma: 2162064 Join Date: Oct 2010 Location: Australia Device: Kindle Oasis	v0.1.1 beta Ok, here is the progress so far. The big ticket missing at this point is handling false positives. However the rest of it is pretty much there. So far I've put in five different algorithms: ISBN exact match Ignore author, exact title (Calibre add default logic) Ignore author, similar title Exact author, similar title (Calibre automerge logic) Similar author, similar title (Author using Kovid's new metadata logic) It returns the matches, and you can navigate through them either by clicking on the toolbar button (ctrl+click on the button for previous) or using keyboard shortcuts (ctrl+\ for next result, ctrl+shift+\ for previous). Groups are dynamically removed every time you move to the next result. False positives are my next issue. The first part which perhaps Chaley has thoughts on is deciding how to store them. For instance the simple case is that a user marks (1,2) as not dups, then (1,3) as not dups. So I could store those as tuples in a list for the library like [(1,2),(1,3)] However in theory we could allow the user to select more than 2 rows. For instance they select (1,2,3). Now what does that actually "mean". Presumably the net effect is indicating [(1,2),(1,3),(2,3)]. Do we break it down and store it that way, or do we store it as [(1,2,3)]? Now what happens if the user then selects (1,2,3,4)? As obviously that is a superset. Or what if they select (1,2,5)? I haven't got my head around whether we should flatten this out yet etc. Presumably we also need to cater for the user screwing up and allowing them to "undo" a no duplicate set. And how could we show the user visually what duplicate exemptions they have in place? That might need a special screen being built for it I think (and let them delete exemptions from that screen). Feedback as always appreciated, be it on the code, usability or whatever. So far it is fairly simple I think. Last edited by kiwidude; 04-12-2011 at 11:25 AM. Reason: Removed attachment, see later posts

04-12-2011, 02:55 AM	#11
kiwidude Calibre Plugins Developer Posts: 4,637 Karma: 2162064 Join Date: Oct 2010 Location: Australia Device: Kindle Oasis	@tomd365 - thanks for trying this and the feedback. The "hanging" sounds a concern though. I have made two changes - one was a fix to the logic doing the cleaning up when you move to the next book, and the second is an improvement to the "xxx author, similar title" algorithms to strip whitespace from the fuzzied result, so that titles of "Manichu" and "Manichu -" would end up in the same group. If you can give me some exact steps to replicate any hanging you experienced it would be much appreciated. It wasn't very clear from your post. Did you mean more than two matching groups, or more than two books in a group? If the latter - how many groups, just one or multiple? What merges did you do - all of them required to get down to one record or just a couple? When did it actually "hang" - when you did next record, while you were merging or when you ran the duplicate check again? Last edited by kiwidude; 04-12-2011 at 03:15 AM. Reason: Fix some shocking grammar.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Duplicate Detection	Philosopher	Library Management	114	09-08-2022 07:03 PM
[GUI Plugin] Plugin Updater Deprecated	kiwidude	Plugins	159	06-19-2011 12:27 PM
Duplicate Detection	albill	Calibre	2	10-26-2010 02:21 PM
New Plugin Type Idea: Library Plugin	cgranade	Plugins	3	09-15-2010 12:11 PM
Help with Chapter detection	ubergeeksov	Calibre	0	09-02-2010 04:56 AM

04-10-2011, 02:12 PM	#1
kiwidude Calibre Plugins Developer Posts: 4,637 Karma: 2162064 Join Date: Oct 2010 Location: Australia Device: Kindle Oasis	Duplicate detection plugin Call me a masochist, but I can't resist the challenge of getting something up and running for this. I've tried to say no, but I can't. It is a disease. We had a bunch of really good discussions in this on the Duplicate Detection thread - Starson17 & Kovid implemented the automerge changes which occupied the first four pages, then the posts more relevant to the plugin begain in earnest with some good stuff being thrown around. I considered continuing this post on that thread, but it is clear it is now all about plugin design decisions and has no place in the library mgmt forum. So having reviewed the thread, I have had a few thoughts. Re-using the library view is a no-brainer for me now, lets keep the plugin as simple as possible and let the user retain all the custom columns, sorting, right-clicks etc they have today to decide on how to resolve a merge. There was much discussion about reviewing results by "duplicate set" versus "by book that has duplicates". There seemed pros and cons with both. The immediate issue I have with the latter is how you can visually indicate in the gui which book is the "root" that you are considering, since you cannot rely on sorting. The only way I could think of would be a search restriction on that subset of books, then using highlighting to mark the root book in the results. But that could be messy for the user to exit out of. Unless you have other ideas I will start off with just set based result reviewing for now and we can add the other later if desired. The plugin will display one duplicate "result" at a time (be it a set or book based), and via a right-click/keyboard shortcut move to the next result. When a user chooses the "Find duplicates" option they get a popup dialog letting them choose the algorithm to apply (assuming more than one available eventually). Eventually they may also get the option of whether to review the results by set or by book and other "stuff" like automerging. As the user can merge results as they go, the "next result" navigation will need to be fairly dynamic. I think the duplicate result sets should be held in memory by the plugin, rather than persisted into a custom column with a group number. We now have the "marked" feature to be able to display books matching a criteria. So when a user moves to the next result, I can grab the next set from memory, check whether any of the books have been deleted (resulting from a merge or delete), and if still have more than one then "mark" the books and display in the GUI. Simples. Taking this approach will mean that if they restart Calibre again they must run their dup search again to rebuild the in-memory result sets. However as we anticipate this being fairly quick I don't forsee any issues with that. I like Starson17's suggestion of a custom tags column that stores some sort of alg01-bookid1-bookid2 combination when a user right-clicks to indicate a false positive. Presumably you would only need to store this on the book with the lowest id if the dup finding algorithms also sorts by ids. There is the issue of what if the user subsequently merges that book into bookid4? I think in that situation the false positive record should become invalid? Without this being completely integrated into the merge code of Calibre, I think the dup checking should just take responsiblity for clearing out such "dud" false positive entries when it finds them. To begin with I will only run one algorithm at a time, not multiple at once. So a user would run the "exact author, fuzzy title" algorithm (i.e. automerge) and have a pretty high confidence that they will be able to merge results found. Then after reviewing their results they would run the "fuzzy author, fuzzy title" algorithm or whatever, and know that they may get a lot of results which will require far more careful identification of false positives. Mixing the results together particularly if we do a book based review could get horribly confusing? Due to all the complexities of choosing "who wins" I don't want to actually perform the merge or offer that as an option at this point - it is up to the user to go through each set and do the merge right-clicks, move to the next set etc. Maybe down the road there might be some kind of "automerge" suboption based on choices like "oldest is the master" or whatever but will avoid this initially at least. You guys have any further thoughts on this before I take the plunge?

04-11-2011, 09:42 AM	#4
kiwidude Calibre Plugins Developer Posts: 4,637 Karma: 2162064 Join Date: Oct 2010 Location: Australia Device: Kindle Oasis	Hey Charles, thanks for the replies. I don't think we differ at all on thoughts on how the data should be partitioned, and that a set based approach to viewing the results has a number of advantages over book based in terms of simplicity of the UI etc. Where to store the duplicate pair exemption is an interesting one. Sounds like we are agreed that the algorithm is not relevant to the pairing. The maintenance aspect is a point I touched on in a slightly different fashion, I don't believe you would need to store it on both books if you process in book id order but there is certainly an issue when one of the books in the pair becomes merged or deleted. Now we have library uuid it is more feasible to store library specific ids in a config file so I'll add that to the mix. I'm not sure I understand all your persistence reasons, as surely similar justification for not using a custom column to store exemptions applies to storing groups? Take for instance the example of (1,2) and (1,3) being found. We display (1,2). Now if the user goes "next group" without doing anything, we will want to display (1,3). If the user merges book 2 into book 1 before going next group, we will still want to display (1,3). However if they merge book 1 into book 2, then the (1,3) 2nd group is invalidated. If that information has been persisted into a custom column, the plugin potentially will be doing a lot of repeated querying to find a next group, validate all the ids, clearing the persisted column if group is not valid, move to next group etc. I would have thought that might be a little easier to do when not persisted in the database? For point (3) of viewing groups a book is in, you can still do that. You would always have to rely on some kind of in-memory mapping the plugin kept of groups and their members. So I don't see that reason as a differentiator - in fact it could well be easier just using in-memory results because you don't have to re-read all the custom group ids and re-build the mappings as you would if relying on the persisted values? I think the point (2) of doing the resolving across Calibre sessions is absolutely valid, though I think it comes down to the workflow again and how long it takes to run the duplicate search. For me I either resolve the pair in some way or else it will come up again the next time I run the search. Whether I do that now or the next time I open Calibre (and trigger another search) to me doesn't matter, that pair still needs to be resolved in some way. But of course thats just my opinion . I think at this point I will still go for the "marked" memory based approach and see how well it works. Converting it to use persisted custom columns at a later point should only involve adding to the logic I would have thought, not having to bin everything. If I also store the false positives in the config file then that means I can avoid custom columns completely. I would like to avoid them if I can, as it just avoids the issues of column visibility, deletions, attempted sorting, naming conflicts, users fiddling with values etc. It is a good point about using a custom column for transitive based sorting. Though unless you forced that column visible on the view one stray click by the user and the sorting is lost, they would like have to do "next group, previous group" to get it restored. However at least for the first cut since transitive groups are not being displayed I think I can again avoid this issue for now anyways.

04-11-2011, 10:35 AM	#6
kiwidude Calibre Plugins Developer Posts: 4,637 Karma: 2162064 Join Date: Oct 2010 Location: Australia Device: Kindle Oasis	Hey Charles, no we are on the same page when it comes to what is stored in the CC.. What I thought would not be useful as a user though is to get shown a duplicate group that had only one entry in it when I go "next result". So just displaying book 3 by itself wouldn't give me a high level of confidence the plugin was "working right", even though we know the reasons why there is just now one in the group . It is that extra processing behind the scenes to invalidate that group etc that was my concern of the CC versus in-memory approach. For instance it is going to be impossible to give the user an updated number of how many more "real" duplicate groups there are, unless you go through and invalidate them every time they hit next result, to remove all deleted books and reset the group counts. Something that could be done in memory less expensively/easily I would argue than via a CC. Because there are no hooks for this plugin to know that a user has deleted a book or merged a book, it must check each id in the duplicates space every time you hit next to give that count. You are of course correct that you wouldn't need to rebuild the mappings to display the book view so thanks for correcting me on that. Brain fart. And about making a column visible. Yeah that's where you get into the problem of a column being there that you only want while you are resolving duplicates. And the myriad of issues about ways to hide it etc. If you get sidetracked while resolving and want to do other queries/restart Calibre etc and have this column in your view that "annoys" when not being used. If I can avoid it then I would prefer to

04-11-2011, 10:56 AM	#8
kiwidude Calibre Plugins Developer Posts: 4,637 Karma: 2162064 Join Date: Oct 2010 Location: Australia Device: Kindle Oasis	Haha, dunno about the quality part mate, I've had a couple of crappy bugs with all the plugin rewrites recently, trying to context switch too often and Eclipse/Pydev has decided to not be of any use to me any more at indicating missing imports etc. May as well be writing in Notepad again. Might be time to try Aptana which John says has less issues. I enjoy the debate to verify I'm not wandering too far down the wrong path. You guys inevitably suggest things I haven't considered. For instance just now you mentioned the tag browser. Using a custom column would of course let you "navigate" your groups that way, and as you say give you an ongoing count of the groups which is something I hadn't figured out how to do other than via the status bar. I think that the group count is invalid unless you go through and "prune" (clear) group ids for books that no longer have more than one member though, and there has to be some action by the user to trigger that unless it is hooked into delete events (which as I discovered with a previous plugin and discussed here before don't get fired by database2.py anyways with that notify stuff). But a starting point is better than nothing, and I reckon 90% of it should be reusable should we steer down another path once people start feeding back on usability etc.

04-11-2011, 09:11 PM	#10
tomd365 Junior Member Posts: 6 Karma: 10 Join Date: Mar 2010 Device: PRS 505	I used this tonight on my collection and it worked very well. If the program found more than two matching books, it would hang after merging. Other than that it worked great - what a timesaver.

04-12-2011, 10:32 AM	#13
kiwidude Calibre Plugins Developer Posts: 4,637 Karma: 2162064 Join Date: Oct 2010 Location: Australia Device: Kindle Oasis	Hey Charles, thanks for the comments. Yeah I was rather tired last night when I posted but wanted to ask for input before I crashed. I had come to the same conclusion of a dictionary by book of exclusions but wanted to make sure I wasn't missing some obvious alternative. Part of my hesitation was thinking about ways users could mark not duplicates. For instance if I let it be a free-for-all, then in it could be a lot of data. Say for instance I allowed a user to mark any books they liked together as not duplicates. So I have a library of 1,000 books as a starting point, and I am confident I have no duplicates. If I selected the whole lot and said mark as not duplicates, then I would ensure that any duplicates that came up in future were not books I had considered. I could skip through the book groups with a very "loose" duplicate matching algorithm, then rather than resolving each group I could just resolve the ones I were dups, then select my whole library and mark it as not containing duplicates again. From a user perspective this is kind of simple. From a data perspective it is a bit of a disaster, storing a cross-map table of every id with every other id So to prevent that scenario, I would instead only allow a user to mark as not duplicates books in the found sets - if the selection you make is not found together in the same set with the search algorithm you can't add them as a duplicate. That would keep the data volumes down, at the expense of more clicks required by the user. Perhaps a compromise would be an additional menu option of something like "mark all groups as not duplicates". So a user could still skip through their duplicate results found, and then if they are happy they never want to see those again they can with one click add the various pairings within those groups as marked pairs. That would be dramatically less data while minimising the user effort. Does that sound useful? As for the presentation, yeah need to think about that one. It has the same issues (I think) as a book based view. You have a root book, and you have the related books marked as not duplicates. From a user perspective you perhaps initially want to see the set of all books marked as not duplicates, then when you click on one be able to see what books it is duplicates with. As you say the whole nasty issue of transitivity comes up when you visually display that information to consider the pairings. Maybe the easiest way is to only allow the user to remove all exceptions for a book. So we have a menu option of "Show marked not duplicates" which displays every book that has some kind of a not duplicate relationship. Then a user can choose a book and choose "Remove from not duplicates" which breaks all associations other books have with it. So if I have (1,2) and (1,3) marked as not duplicates, I choose "Show" which will display 1, 2, 3. If I choose "Remove" on book 1, then both it's exclusion relationship with 2 and with 3 are deleted. If 2 & 3 have no exclusions with other books then they also get removed from the exclusion dictionary, so when it refreshes the view it would display no results. So if a user wanted to remove just the (1,2) relationship, they would need to do the remove on book 2. Then when it refreshes it would still be showing book 1 and book 3. Without having some visual indication of the relationship pair graph between the books I think it is the simplest approach. Any thoughts/objections?

04-12-2011, 11:08 AM	#15
kiwidude Calibre Plugins Developer Posts: 4,637 Karma: 2162064 Join Date: Oct 2010 Location: Australia Device: Kindle Oasis	Hi Starson17 - ahhh, it must be a very small library (less than 20 books perhaps?). I borrowed some code from my quality check plugin and it would suffer from the same issue. I will stick a new version up shortly.

Advert

Advert