Quote:
Originally Posted by kiwidude
I am less concerned at this point about the "identification algorithms" as they can be added/tweaked over time.
|
I agree we don't need to think about any specific "identification of duplicates" algorithm now. However, we probably do need to think about whether there will be multiple different algorithms (presumably selected by the user) or a single one selected by the code author(s). If there's only one algorithm, then avoiding false positives in multiple runs is probably easier than if there are multiple algorithms.
If there are multiple algorithms, one approach is to use Charles' idea about multiple columns, one for each algorithm, to track and avoid false positives when/if that algorithm is run again. Another approach would be to store is_multiple tag keys for each book:
algorithm1#-book2id-book3id-book4id,
algorithm2#-book2id-book5id,
algorithm3#-book2id-book3id
For this book (it's book1), three duplicate/matching algorithms have been run. When the first (identified as algorithm1#) was run, it found book book2, book3 and book4 as matches, but the user said they were not matches, and that info on false positives was stored against algorithm #1 for book1
When algorithm2# was run, it found book2 and book5 as false positives (any other dupes it found would have been merged into book1). Presumably this algorithm did not think that book3 or book 4 were dupes of book1, because if it had, the user presumably would have marked them as false positives, too.
When algorithm3# was run, it found books 2 and 3 (but not 4 or 5)
I'm inclined to think that offering multiple search algorithms is a necessary feature. Avoiding false positives on multiple runs of each algorithm would be nice, but could be added later, provided we structure things in a way that doesn't exclude adding that feature.