MobileRead Forums - View Single Post

chaley · 02-10-2011, 09:53 AM

Quote:

Originally Posted by Starson17

Yes, although there are two sorts of "next" sets to be shown. The first is the next set for the next group of matched identical books in the next set that has no relation to the previous set. The other is the "next" set for the current book where the current book is a member of more than one identical book set.

I hadn't considered the first one. It isn't clear to me what it means, unless you are talking about sets generated by different tests. I am not convinced of the usefulness of that, assuming that we have a way of removing known false positives.

Quote:

The latter type of "next" set only occurs if the matching process permits books to be members of more than one set. I'm still not convinced that we need to allow that at a single point in time. Clearly we need it for different runs (Run 1 match author/title using the automerge function and show duplicate sets, Run 2 do soundex matching of title only, Run 3 do soundex matching of author and exact match title, etc.) but do we need to do all three runs and store the results at the same time?

Would it not be sufficient to do the runs individually for each matching function?

Clearly we have different mental models here.

Mine is, roughly speaking, that the user requests that some tests be run. These are all run together, producing sets of candidate duplicates. Depending on the fuzziness of the matches, a book can be in more than one set because fuzzy matching isn't transitive (f we have (a matches b) and (b matches c), there is nothing that says that (a matches c)). I don't think that we should force transitivity, so by extension I don't think we should disallow books in multiple sets.

The next step is to ensure that known/declared not-duplicates are removed from the sets. This removes known false positives. This will by necessity produce new sets. For example, assume that the test returns books (1,2,3). Assume further that books (2,3) are known to not be duplicates. To remove the false positive but keep the information the test produced, we must partition (1,2,3) into (1,2) and (1,3).

Thus, we have two ways to get the same book into different duplicate sets: non-transitive operations and known duplicate removal.

You have introduced a third: the kind of test. I am not sure about the usefulness of this. Do I really care how the potential duplicate was found, again assuming I can remove false positives? If the answer is yes, then I suggest that the different tests use different custom columns, thereby separating the results.