View Single Post
Old 04-12-2011, 07:18 AM   #12
chaley
Grand Sorcerer
chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.
 
Posts: 12,449
Karma: 8012886
Join Date: Jan 2010
Location: Notts, England
Device: Kobo Libra 2
Quote:
Originally Posted by kiwidude View Post
False positives are my next issue. The first part which perhaps Chaley has thoughts on is deciding how to store them. For instance the simple case is that a user marks (1,2) as not dups, then (1,3) as not dups. So I could store those as tuples in a list for the library like [(1,2),(1,3)]

However in theory we could allow the user to select more than 2 rows. For instance they select (1,2,3). Now what does that actually "mean". Presumably the net effect is indicating [(1,2),(1,3),(2,3)]. Do we break it down and store it that way, or do we store it as [(1,2,3)]?

Now what happens if the user then selects (1,2,3,4)? As obviously that is a superset. Or what if they select (1,2,5)? I haven't got my head around whether we should flatten this out yet etc.
The answer to the question "how do I store" will depend on how you process them and how you do partitioning. For example, the algorithm that I posted a while back wants a map of sets, mapping an ID to a set of IDs that are known not duplicates. This map would be a good way to save the information, in which case when given a list of not-dups, you would iterate over that list adding all the members to each set.
Code:
  for bk in not_dups:
    nd_map[bk] |= not_dups
Whether or not you trim bk out of not_dups depends (again) on whether the algorithm is sensitive to the set containing itself.

Editing the not_dups list is mostly a presentation issue. The easiest way to do it (but perhaps not the easiest way to use it) would be to select a book and see the list of not_dups for that book (the value of nd_map[bk]). The user could check (or del or something) any book from that set, say with id bk2. The code would do:
Code:
for bk2 in checked_books:
  nd_map[bk] -= set([bk2])
  nd_map[bk2] -= set([bk])
The tricky thing is transitivity. If I say (1,2) are not_dups, then I say (1,3) are not_dups, the algorithm cannot generate (2,3) as a consequence. The map approach respects this restriction, as it never does a closure. To use your example and the above approach: (1,2,3,4) would generate
{ 1:(1,2,3,4), 2:(1,2,3,4), 3:(1,2,3,4), 4:(1,2,3,4) }
Then adding (1,2,5) would result in:
{ 1:(1,2,3,4,5), 2:(1,2,3,4,5), 3:(1,2,3,4), 4:(1,2,3,4), 5:(1,2,5) }
chaley is offline   Reply With Quote