View Single Post
Old 02-10-2011, 07:47 AM   #72
chaley
Grand Sorcerer
chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.
 
Posts: 11,742
Karma: 6997045
Join Date: Jan 2010
Location: Notts, England
Device: Kobo Libra 2
Eyes glazing over ...

Quote:
Originally Posted by chaley View Post
In further conversations with kiwidude, the question of false positives came up. My suggestion would be to permit the user to say that two (or more?) given books are not duplicates. This information would be used by the duplicate detector to ensure that those books never appear together in a duplicate-book partition. The performance of this check would be very good if set arithmetic is used. Something like (in pseudo-code)

snip
After a bit of research and thought, I realized that a) the above algorithm doesn't work, and b) this is really a graph theory problem. The duplicate sets are equal to the number of different paths through a graph of nodes (books) in the test result, removing edges between nodes that are known not to be duplicates.

An example implementation is under the spoiler.
Spoiler:
Code:
from collections import defaultdict

# Construct map of books that are not duplicates
dups = [(3, 7), (10, 66, 11)]
print 'books known not to be duplicates', dups
not_duplicate_of_map = defaultdict(set)
for t in dups:
  s = set(t)
  for b in t:
    not_duplicate_of_map[b] = s

# Simulate a test
initial_dups =  [2, 3, 66, 7, 10, 11, 12]
initial_dups.sort()
print 'candidate duplicates', initial_dups

# Walk the nodes as a directed graph, refusing to visit nodes that 
# are declared to be not duplicates of a node already visited. This
# algorithm depends on the lists being sorted.
found_paths = []

def walk(node, path, remaining_nodes):
  path.append(node)
  rn = [n for n in remaining_nodes if n != node and n not in not_duplicate_of_map[node]]
  if len(rn) == 0:
    found_paths.append(sorted(path))
  for n in rn:
    if n > node:
      walk(n, path, rn)
  path.pop()


for k in initial_dups:
  walk(k, [], initial_dups)
print 'After partioning', found_paths

This implementation depends on the fact that the items in the graph are numbers and can be fully ordered, so it is easy to prune duplicate paths simply by never traversing an edge to a node less than the one in hand. As the list is sorted, such an edge would already have been traversed in the other direction, so the algorithm does not need to deal with discovering both (1,2,3) and (2,3,1).
chaley is offline   Reply With Quote