MobileRead Forums - View Single Post

Starson17 · 02-10-2011, 12:48 PM

Quote:

Originally Posted by chaley

I hadn't considered the first one. It isn't clear to me what it means, unless you are talking about sets generated by different tests. I am not convinced of the usefulness of that, assuming that we have a way of removing known false positives.
Clearly we have different mental models here.

Getting matching mental models does help

Quote:

Mine is, roughly speaking, that the user requests that some tests be run. These are all run together, producing sets of candidate duplicates. Depending on the fuzziness of the matches, a book can be in more than one set because fuzzy matching isn't transitive (f we have (a matches b) and (b matches c), there is nothing that says that (a matches c)).

I was discussing multiple models (and doing a lousy job separating them).

The first was the current automerge matching model, which is transitive. An incoming title is processed by the matching function to produce a match pattern. A candidate matching title is processed by the same matching function. If the result for that title matches the match pattern exactly, they are duplicates. a=b and b=c implies all three produce the same match pattern, so a=c. Implementing this easily allows global simultaneous review of all sets with dividers or highlighting to separate groups. A book is only in one set - the set that matches the match pattern for that book.

Quote:

I don't think that we should force transitivity, so by extension I don't think we should disallow books in multiple sets.

In the automerge-based model above, I was forcing transitivity and by extension disallowing books in multiple sets. I was unduly influenced by thinking about automerge, which is book-based, while you are thinking of a set based approach.

Quote:

The next step is to ensure that known/declared not-duplicates are removed from the sets. This removes known false positives. This will by necessity produce new sets. For example, assume that the test returns books (1,2,3). Assume further that books (2,3) are known to not be duplicates. To remove the false positive but keep the information the test produced, we must partition (1,2,3) into (1,2) and (1,3).

Thus, we have two ways to get the same book into different duplicate sets: non-transitive operations and known duplicate removal.

You have introduced a third: the kind of test. I am not sure about the usefulness of this. Do I really care how the potential duplicate was found, again assuming I can remove false positives? If the answer is yes, then I suggest that the different tests use different custom columns, thereby separating the results.

A second model is closer to yours, but is still book-based. Instead of looking at match sets in sequence, one looks at individual books in sequence and considers each set . Book 1 matched book 2 in your partitioned set (1, 2). It also matched book 3 in partitioned set (1,3).

I was thinking I'd ask to see a first list of books to review for possible dupes:

First approach - book based:
I'd see the set (1,2) and decide if 1 and 2 were dupes. I'd then press "next set," see (1,3) and decide if 1 and 3 were dupes. Now I'd press "next set" and see (4, 8). Note that I'm looking at match sets in the order of the books - 1, 2, 3, 4, etc, skipping books that have no matches and any sets previously considered. Note also the step between showing books in the sets that include book 1 versus the step to another totally unrelated set (4, 8) that I thought might be useful to signal. - The two types of "next set" I mentioned.

Second approach - still book based:
When asking to see the first set of matching books, why not show me the set (1, 2, 3)? Yes, 2 and 3 are not dupes, but I'm not sure if that's useful when showing the books that match book 1. I still need to see if 1 matches 2 and 3. Is it better to do it in two stages or in one?

In the first (automerge) transitive model and the second of the two other approaches above, there is only one set per book. In all three, one is doing a book based review. "Show me books that match this book" In a set based review, one must consider all the cross links. The number of decisions for set-based review is the number of combinations of two books selected from the match set. For example: in a set of 8 members, I need to make 28 decisions (is the third book the same as the last book, is the fifth book the same as the sixth, etc.). With a book based approach, I need to make only eight decisions (is any book in the set a duplicate of the book under review).

In set based you have a large number of combinations to consider for each set and you have multiple sets for each book, but fewer total sets to analyze.

In book based, you have fewer decisions for each set, and you can collapse the sets for that book, if you wish, but you have many more sets to review.

As usual, it's just random thoughts - I've got no certainty as to what would work best.