Duplicate Detection - Page 6

Starson17 · 02-10-2011, 09:38 AM

Quote:

Originally Posted by chaley

It is impossible for books in multiple sets

I know. The reason I didn't consider the possibility that a book could be in multiple sets was because I assumed we'd use the same process used for automerge. It processes the author and/or title to produce a character string. Then identical character strings are grouped. You only get one character string per book, so a book can only be in one set of identical books. When you run a different function against the author/title, you get a different character string and different matching sets, but again, you get only one.
I'll admit, I haven't quite grokked how you will generate multiple sets for a single book. Are you thinking of different runs at different times and storing the results or multiple matching functions run during the matching process ... or what?

Quote:

There clearly has to be a plugin involved here. I am assuming that books are viewed on the library view, rather than reinventing another view in the plugin.

Agreed

Quote:

I see no reason why the plugin cannot handle the 'show members of next duplicate set' through a context menu entry or through a keyboard shortcut. It would remember the last one it looked at, get the next one (easy if they are numbers), and do the search for you. You can use the highlight option if you want to see them in a larger context, or turn highlight off to see only that set.

Yes, although there are two sorts of "next" sets to be shown. The first is the next set for the next group of matched identical books in the next set that has no relation to the previous set. The other is the "next" set for the current book where the current book is a member of more than one identical book set.

The latter type of "next" set only occurs if the matching process permits books to be members of more than one set. I'm still not convinced that we need to allow that at a single point in time. Clearly we need it for different runs (Run 1 match author/title using the automerge function and show duplicate sets, Run 2 do soundex matching of title only, Run 3 do soundex matching of author and exact match title, etc.) but do we need to do all three runs and store the results at the same time?

Would it not be sufficient to do the runs individually for each matching function?

Starson17 · 02-10-2011, 09:50 AM

Quote:

Originally Posted by chaley

false positives ... removing edges between nodes that are known not to be duplicates.

It's worth considering how a duplicate finder is likely to be used. Will it be used only to find and permanently merge or eliminate duplicates? Or will it also be used as some sort of pseudo search extension.

If the search functions for duplicates include soundex functionality (similar sounding names - fuzzy matching) that isn't implemented in the search bar, we may want to be able to disable the false positive removal, or implement the duplicate finding functions in the search bar.

I know that at some point I'm going to find a group of near duplicates that I don't want to merge and do want to eliminate from further duplicate searches, but which I later want to find as a group simply because I remember I found that group once before and I want to see it again.

chaley · 02-10-2011, 09:53 AM

Quote:

Originally Posted by Starson17

Yes, although there are two sorts of "next" sets to be shown. The first is the next set for the next group of matched identical books in the next set that has no relation to the previous set. The other is the "next" set for the current book where the current book is a member of more than one identical book set.

I hadn't considered the first one. It isn't clear to me what it means, unless you are talking about sets generated by different tests. I am not convinced of the usefulness of that, assuming that we have a way of removing known false positives.

Quote:

The latter type of "next" set only occurs if the matching process permits books to be members of more than one set. I'm still not convinced that we need to allow that at a single point in time. Clearly we need it for different runs (Run 1 match author/title using the automerge function and show duplicate sets, Run 2 do soundex matching of title only, Run 3 do soundex matching of author and exact match title, etc.) but do we need to do all three runs and store the results at the same time?

Would it not be sufficient to do the runs individually for each matching function?

Clearly we have different mental models here.

Mine is, roughly speaking, that the user requests that some tests be run. These are all run together, producing sets of candidate duplicates. Depending on the fuzziness of the matches, a book can be in more than one set because fuzzy matching isn't transitive (f we have (a matches b) and (b matches c), there is nothing that says that (a matches c)). I don't think that we should force transitivity, so by extension I don't think we should disallow books in multiple sets.

The next step is to ensure that known/declared not-duplicates are removed from the sets. This removes known false positives. This will by necessity produce new sets. For example, assume that the test returns books (1,2,3). Assume further that books (2,3) are known to not be duplicates. To remove the false positive but keep the information the test produced, we must partition (1,2,3) into (1,2) and (1,3).

Thus, we have two ways to get the same book into different duplicate sets: non-transitive operations and known duplicate removal.

You have introduced a third: the kind of test. I am not sure about the usefulness of this. Do I really care how the potential duplicate was found, again assuming I can remove false positives? If the answer is yes, then I suggest that the different tests use different custom columns, thereby separating the results.

chaley · 02-10-2011, 10:01 AM

Quote:

Originally Posted by Starson17

It's worth considering how a duplicate finder is likely to be used. Will it be used only to find and permanently merge or eliminate duplicates? Or will it also be used as some sort of pseudo search extension.

If the search functions for duplicates include soundex functionality (similar sounding names - fuzzy matching) that isn't implemented in the search bar, we may want to be able to disable the false positive removal, or implement the duplicate finding functions in the search bar.

This is a good idea, and not disallowed by the schemes being discussed. A search would produce a set. I don't see any necessity to do known-duplicate processing.

I should point out that as it is, search is not capable of comparing a given book against all books in the library. Some serious work would be required to be able to ask the question "find all books that are like this one". If the fuzzy searches are invertable (can be determined from book data), then I can see generating a fuzzy-search expression that produces a list of matches. However, if the fuzzy searches are one way, where some algorithm is applied and some number of books 'win', then things are much more interesting.

Quote:

I know that at some point I'm going to find a group of near duplicates that I don't want to merge and do want to eliminate from further duplicate searches, but which I later want to find as a group simply because I remember I found that group once before and I want to see it again.

It seems that you are saying that you want the option to not do known-duplicate processing. That should be easy enough for the GUI-man.

Starson17 · 02-10-2011, 12:48 PM

Quote:

Originally Posted by chaley

I hadn't considered the first one. It isn't clear to me what it means, unless you are talking about sets generated by different tests. I am not convinced of the usefulness of that, assuming that we have a way of removing known false positives.
Clearly we have different mental models here.

Getting matching mental models does help

Quote:

Mine is, roughly speaking, that the user requests that some tests be run. These are all run together, producing sets of candidate duplicates. Depending on the fuzziness of the matches, a book can be in more than one set because fuzzy matching isn't transitive (f we have (a matches b) and (b matches c), there is nothing that says that (a matches c)).

I was discussing multiple models (and doing a lousy job separating them).

The first was the current automerge matching model, which is transitive. An incoming title is processed by the matching function to produce a match pattern. A candidate matching title is processed by the same matching function. If the result for that title matches the match pattern exactly, they are duplicates. a=b and b=c implies all three produce the same match pattern, so a=c. Implementing this easily allows global simultaneous review of all sets with dividers or highlighting to separate groups. A book is only in one set - the set that matches the match pattern for that book.

Quote:

I don't think that we should force transitivity, so by extension I don't think we should disallow books in multiple sets.

In the automerge-based model above, I was forcing transitivity and by extension disallowing books in multiple sets. I was unduly influenced by thinking about automerge, which is book-based, while you are thinking of a set based approach.

Quote:

The next step is to ensure that known/declared not-duplicates are removed from the sets. This removes known false positives. This will by necessity produce new sets. For example, assume that the test returns books (1,2,3). Assume further that books (2,3) are known to not be duplicates. To remove the false positive but keep the information the test produced, we must partition (1,2,3) into (1,2) and (1,3).

Thus, we have two ways to get the same book into different duplicate sets: non-transitive operations and known duplicate removal.

You have introduced a third: the kind of test. I am not sure about the usefulness of this. Do I really care how the potential duplicate was found, again assuming I can remove false positives? If the answer is yes, then I suggest that the different tests use different custom columns, thereby separating the results.

A second model is closer to yours, but is still book-based. Instead of looking at match sets in sequence, one looks at individual books in sequence and considers each set . Book 1 matched book 2 in your partitioned set (1, 2). It also matched book 3 in partitioned set (1,3).

I was thinking I'd ask to see a first list of books to review for possible dupes:

First approach - book based:
I'd see the set (1,2) and decide if 1 and 2 were dupes. I'd then press "next set," see (1,3) and decide if 1 and 3 were dupes. Now I'd press "next set" and see (4, 8). Note that I'm looking at match sets in the order of the books - 1, 2, 3, 4, etc, skipping books that have no matches and any sets previously considered. Note also the step between showing books in the sets that include book 1 versus the step to another totally unrelated set (4, 8) that I thought might be useful to signal. - The two types of "next set" I mentioned.

Second approach - still book based:
When asking to see the first set of matching books, why not show me the set (1, 2, 3)? Yes, 2 and 3 are not dupes, but I'm not sure if that's useful when showing the books that match book 1. I still need to see if 1 matches 2 and 3. Is it better to do it in two stages or in one?

In the first (automerge) transitive model and the second of the two other approaches above, there is only one set per book. In all three, one is doing a book based review. "Show me books that match this book" In a set based review, one must consider all the cross links. The number of decisions for set-based review is the number of combinations of two books selected from the match set. For example: in a set of 8 members, I need to make 28 decisions (is the third book the same as the last book, is the fifth book the same as the sixth, etc.). With a book based approach, I need to make only eight decisions (is any book in the set a duplicate of the book under review).

In set based you have a large number of combinations to consider for each set and you have multiple sets for each book, but fewer total sets to analyze.

In book based, you have fewer decisions for each set, and you can collapse the sets for that book, if you wish, but you have many more sets to review.

As usual, it's just random thoughts - I've got no certainty as to what would work best.

chaley · 02-11-2011, 02:56 AM

Quote:

Originally Posted by Starson17

Second approach - still book based:
When asking to see the first set of matching books, why not show me the set (1, 2, 3)? Yes, 2 and 3 are not dupes, but I'm not sure if that's useful when showing the books that match book 1. I still need to see if 1 matches 2 and 3. Is it better to do it in two stages or in one?

Doesn't that mean that you must spend brain cycles deciding again that 2 and 3 are not dupes? I can see why you might want that, but then one must work through rather carefully the notion of 'false positive'.

Regarding transitivity, consider the following. Assume:
- a test that matches if two books contain one title word in common and 1 author in common.
- a book 'Ectoplasm' by Joe Blogs (book 1)
- a book 'Auras' by Patricia Posts (book 2)
- A book 'Ectoplasm and Auras' by Joe Blogs and Patricia Posts (book 3). This is an omnibus edition.

The test will identify books (1,3) and (2,3) as potential dupes. Transitivity would give us (1,2,3), which is clearly wrong, as 1 and 2 are definitely not dupes of each other. I am ignoring further levels transitivity, which would expand the set even more.

The question then becomes which is better, showing all three which might help identifying the omnibus but requiring some thought to ignore the (1,2) pair, or showing (1,3) (2,3) which shows the information the test actually found (and avoids the transitive closure problem). I don't have an answer. My guess is that this will come to personal preference. Joy to the GUI man.

kacir · 02-11-2011, 03:05 AM

Quote:

Originally Posted by chaley

Regarding transitivity, consider the following. Assume:
- a test that matches if two books contain one title word in common and 1 author in common.
- a book 'Ectoplasm' by Joe Blogs (book 1)
- a book 'Auras' by Patricia Posts (book 2)
- A book 'Ectoplasm and Auras' by Joe Blogs and Patricia Posts (book 3). This is an omnibus edition.

The test will identify books (1,3) and (2,3) as potential dupes. Transitivity would give us (1,2,3), which is clearly wrong, as 1 and 2 are definitely not dupes of each other. I am ignoring further levels transitivity, which would expand the set even more.

The question then becomes which is better, showing all three which might help identifying the omnibus but requiring some thought to ignore the (1,2) pair, or showing (1,2) (1,3) which shows the information the test actually found (and avoids the transitive closure problem). I don't have an answer. My guess is that this will come to personal preference. Joy to the GUI man.

I am proponent of the theory that we should start with something that is not perfect but works and is relatively easy to implement then we can use it and discuss how to improve the result. This is how Calibre is developed ;-)

So, at the moment I would be extremely happy if I got result (1,2,3). I would have to go through results anyway and this would be *much* quicker than going through entire collection author after author (checking for the fuzzines in the author name (that is King Stephen; Stephen King; S. King; King, S.; S KING ...))

chaley · 02-11-2011, 04:44 AM

Quote:

Originally Posted by kacir

I am proponent of the theory that we should start with something that is not perfect but works and is relatively easy to implement then we can use it and discuss how to improve the result. This is how Calibre is developed ;-)

So, at the moment I would be extremely happy if I got result (1,2,3). I would have to go through results anyway and this would be *much* quicker than going through entire collection author after author (checking for the fuzzines in the author name (that is King Stephen; Stephen King; S. King; King, S.; S KING ...))

Assuming the test I described is used, the natural result would be (1,3) and (2,3). Extending it to (1,2,3) requires more work, the amount of which depends on whether it does one-away or n-away closure.

Also, my experience has shown me that, counter to the RAD religions, having an idea of where one wants to go simplifies life. It is usually hard to unwind a choice, especially ones that have UI and architecture consequences, so thinking a bit about about end points and trajectory at this point is good.

Starson17 · 02-11-2011, 09:12 AM

Quote:

Originally Posted by chaley

Doesn't that mean that you must spend brain cycles deciding again that 2 and 3 are not dupes? I can see why you might want that, but then one must work through rather carefully the notion of 'false positive'.

Not if you think of this as "Show me all books that may be duplicates of Book 1." I don't have to think about anything except possible matches to Book 1. There's a possible duplicate set (1, 2) and another (2, 3), so If I work through the books in book order, and I'm working on Book 1 matches, I only have to decide if Book 1 is a match of 3 and not if Book 2 matches 3.

When examining Book 2, I would have to decide if 2 matches 3. There is no (1,2) match, so it doesn't show on Book 2. With luck, there are no other matches of Book 2 and nothing more appears for that book. We're also done for Book 3, since we've finished the (1, 3) and (2, 3) checks when doing Books 1 and 2 (assuming no other matches for Book 3).

Quote:

Regarding transitivity, consider the following. Assume:
- a test that matches if two books contain one title word in common and 1 author in common.
- a book 'Ectoplasm' by Joe Blogs (book 1)
- a book 'Auras' by Patricia Posts (book 2)
- A book 'Ectoplasm and Auras' by Joe Blogs and Patricia Posts (book 3). This is an omnibus edition.

The test will identify books (1,3) and (2,3) as potential dupes. Transitivity would give us (1,2,3), which is clearly wrong, as 1 and 2 are definitely not dupes of each other. I am ignoring further levels transitivity, which would expand the set even more.

I had in mind review in Book order (or selected order, but still by book) showing only (1, 3) for Book 1, and marking that as not a match or merging them. Book 2 is nowhere a match of Book 1 so it would not be shown when examining Book 1. Book 2 is a possible match of Book 3, so we see (2,3) for Book 2. For Book 3, we see nothing, as the two possible matches (1, 3) and (2, 3) have already been resolved. If we started with Book 3 ('Ectoplasm and Auras' by Joe Blogs and Patricia Posts'), however, then we would have seen (1, 2, 3) and resolved the matches for only Book 3. That would have resolved the (1,3) and the (2,3) matches in a single shot. There are no other matches and we're done. Nothing shows up for Book 1 or Book 2. I agree, for this to work well, we need to know what book is under consideration.

Your model is to do this set by set, instead of book by book. In set by set, there is no "book under consideration" so we can't (shouldn't) show (1, 2, 3) for the reasons you elegantly explained.

Quote:

The question then becomes which is better, showing all three which might help identifying the omnibus but requiring some thought to ignore the (1,2) pair, or showing (1,2) (1,3) which shows the information the test actually found (and avoids the transitive closure problem). I don't have an answer. My guess is that this will come to personal preference. Joy to the GUI man.

I see an advantage to showing what was actually found - (you wrote "(1,2) (1,3)" but from the example it should have been "(1,3) (2,3)"), but it only makes sense to me to show that if we're doing a book by book review and have a specified "book under consideration." If we're doing set by set, then it can explode if Book 2 matches 4 and 4 matches 7,8,9 and ....etc.

Without having played with it, or actually used any code, I lean towards your set-by-set approach, but I was just throwing out what was in my mind from the transitive model (which is also book-by-book) based on the automerge code. Perhaps both could be tested or even added as options.

Also, we've barely discussed what to do with multiple matching functions, which I suspect will need to be handled. If one matching function is author/title based and I mark (2, 3) as "Not Duplicates", then later use a "Find all identical ISBN numbers" as a new matching function, should a (2, 3) match be ignored, even if they have identical ISBN numbers?

I ran into some of these problems when writing my personal duplicate finder SQL code. There's a reason we haven't gotten a good duplicate finder yet

As you said - Joy to the GUI man!

Starson17 · 02-11-2011, 09:35 AM

Quote:

Originally Posted by kacir

So, at the moment I would be extremely happy if I got result (1,2,3).

I lean towards Charles on this, provided we're looking at match sets, and not books. I'd rather see the exact results.

Quote:

I would have to go through results anyway and this would be *much* quicker than going through entire collection author after author (checking for the fuzzines in the author name (that is King Stephen; Stephen King; S. King; King, S.; S KING ...))

Book by book doesn't imply looking at all books - only books that have been found to have at least one other matching book. It means you look at all sets having the book under consideration (knowing that book has been found to have at least one duplicate), removing that book (and all books merged into it) from all other sets, then looking at all remaining sets having the next book that has at least one match, etc..

In set by set, you have to go through the number of sets that have been found. In book by book you have to go through the number of books that appear in any set. The first has fewer items in the list to review (match sets), but more decisions for each item. The second has more items to review (books in at least one set), but fewer decisions for each item. In the first, (IMHO) you can't collapse items as effectively as in the second.

My example was for a set of 8 books. In set by set, you have to make 28 decisions to decide if any book in that set is a match of any other book in the set. In book by book, you look at the first book, then decide if any of the remaining 7 books match. You don't have to decide if the second book is a match of the fifth book, but you can, if you wish. If you merge any books, that merge takes the merged book out of the book list to review, by removing that book from all remaining sets.

Starson17 · 02-11-2011, 09:44 AM

Quote:

Originally Posted by Starson17

I lean towards Charles on this, provided we're looking at match sets, and not books. I'd rather see the exact results.... (IMHO) you can't collapse items as effectively as in the second.

To be clear, I'll expand on the problems of collapsing sets. Say you have found a match set of (1,2,3,4,5). If I'm collapsing sets and reviewing set by set, I have to add to this set all sets that have any of those 5 books in it. There might be 20 books that matched 5, even though there's no chance that any of those 20 books matched 1-4. You'd have 80 useless decisions to make - Do any of them match book 1 - 20 decisions. Do any match Book 2 - 20 more decisions, etc.

However, for book by book review, when reviewing Book 1, you only add in books that have been found to match Book 1. Every decision you make is a decision about a match that was actually found by the match function. You only collapse match sets that include Book1 and only review Book 1 before moving on to Book 2. Of course, while reviewing Book 1, you are very likely to merge books (1, 3, 4) and when you do that, you automatically remove the need to ever review Books 3 and 4. They are removed from all match sets.

chaley · 02-11-2011, 09:50 AM

Quote:

Originally Posted by Starson17

Not if you think of this as "Show me all books that may be duplicates of Book 1." I don't have to think about anything except possible matches to Book 1. There's a possible duplicate set (1, 2) and another (2, 3), so If I work through the books in book order, and I'm working on Book 1 matches, I only have to decide if Book 1 is a match of 3 and not if Book 2 matches 3.

Finally I understand you, I think.

To test my understanding, does the following make sense? Assume that I have done a set-oriented test, and now have a bunch of sets. If I was viewing by book, then when I ask 'show me matches for book X', I would show at one go all the sets containing X. This is a set union, not a transitive closure. In the example above, asking for matches of book 1, I would see book 3. Asking for matches of book 2, I would see book 3. Asking for matches of book 3, I would see books 1 and 2.

This is probably a very useful alternate visualization of the data. I don't think it would be hard. All we would need would be a book -> set map.

I also think that duplicate processing would happen when building the sets, but would not happen when merging the sets (doing the union). I think that this would give the answers very close to what you describe in your second post (the more detailed example).

Quote:

Also, we've barely discussed what to do with multiple matching functions, which I suspect will need to be handled. If one matching function is author/title based and I mark (2, 3) as "Not Duplicates", then later use a "Find all identical ISBN numbers" as a new matching function, should a (2, 3) match be ignored, even if they have identical ISBN numbers?

Who knows?

My guess is that duplicate processing must be optional. Fortunately, this isn't hard. Just don't do the post-pass.

Starson17 · 02-11-2011, 09:55 AM

It's worth commenting that none of this relates to what kiwidude said might be in version 1 - use the existing automerge matching function code.

Automerge produces results where you can't get any of the problems discussed here.

Quote:

- a book 'Ectoplasm' by Joe Blogs (book 1)
- a book 'Auras' by Patricia Posts (book 2)
- A book 'Ectoplasm and Auras' by Joe Blogs and Patricia Posts (book 3). This is an omnibus edition.

None of these would ever match in automerge.

Starson17 · 02-11-2011, 10:04 AM

Quote:

Originally Posted by chaley

To test my understanding, does the following make sense? Assume that I have done a set-oriented test, and now have a bunch of sets. If I was viewing by book, then when I ask 'show me matches for book X', I would show at one go all the sets containing X. This is a set union, not a transitive closure. In the example above, asking for matches of book 1, I would see book 3. Asking for matches of book 2, I would see book 3. Asking for matches of book 3, I would see books 1 and 2.

Yes, you have it exactly right.

I was vaguely thinking of some sort of "Next Book" function with a display (Highlighted? Appears in dialog popup window? Is listed first in a list with other match members?) of the "Book under consideration."

At some point I have to think about every book that's a possible match of every other book. It just seemed to me that doing them in some order was easier then trying to think about all possible cross matches within any given match set. Obviously, I am going to think about those cross matches, and if I do any merges (or we can think of some way to mark some cross matches I'm sure are not duplicates) then we can use that info to reduce the number of additional review steps by removing those books or known non-matches from the remaining books to be reviewed.

chaley · 02-11-2011, 10:24 AM

Quote:

Originally Posted by Starson17

I was vaguely thinking of some sort of "Next Book" function with a display (Highlighted? Appears in dialog popup window? Is listed first in a list with other match members?) of the "Book under consideration."

One way to do this that would live conveniently within calibre's existing capabilities would be to have (yet another) custom column (integer) that is used only for displaying duplicates. When you say 'next book', the CC would first be cleared, then filled in with a number generated from the union of the sets. Number 0 would be the book under examination. The remaining numbers could be derived from several places, such as the number of sets a given book and the book under examination both appear in, or could simply be a counter. This implementation permits easily showing the books in the library view, and also permits sorting and searching while being able to recover the original view.

Hmmm... Perhaps making it a search restriction would be even better. That would facilitate subsearches and subsorts while permitting going back to the original order.

I am suspicious of building a navigation dialog box. My feeling is that there would be a lot of pressure to make it as capable as the library view, including all the metadata edit features. On the other hand, if the dialog shows minimal information but is capable of scrolling the library view to the book (in effect, searching for id:nnnn with highlight turned on), then the pressure might be avoided.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Duplicate Detection	albill	Calibre	2	10-26-2010 02:21 PM
Help with Chapter detection	ubergeeksov	Calibre	0	09-02-2010 04:56 AM
Device Detection doom	Alberto Franches	Calibre	6	06-24-2010 05:38 PM
Device detection?	totanus	ePub	1	12-17-2009 07:05 AM
Structure detection v5.5 and v6.2	AlexBell	Calibre	2	07-29-2009 10:11 PM