View Single Post
Old 04-25-2011, 05:16 AM   #151
chaley
Grand Sorcerer
chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.
 
Posts: 12,450
Karma: 8012886
Join Date: Jan 2010
Location: Notts, England
Device: Kobo Libra 2
Quote:
Originally Posted by kiwidude View Post
@chaley - actually I think I found another issue with that algorithm. I don't it actually "works".

For instance if I set
dups = [(3,4),(3,5)]
initial_dups=[1,2,3,4,5,6]

The results it gives me are:
[1,2,3,4,5],[1,2,4,5,6]

Look at the first group - it has 3 and 4 together. Yet they are specifically exempted from appearing together in a group, and instead 6 has been removed?
The problem you found has to do with how the non-dup sets were built. The code assumed that the sets were complete, but in your example they are not, and so needed to be 'unioned'. Changing the dup_map building code to do that (1 character), the partitioning algorithm produces the right answer.
Code:
books known not to be duplicates [(3, 4), (3, 5)]
candidate duplicates [1, 2, 3, 4, 5, 6]
After partioning [[1, 2, 3, 6], [1, 2, 4, 5, 6]]
That said, the partitioning algorithm I used was stupid. Although you have already done your own and therefore don't need this, for completeness here is a much better one that avoids the highly exponential behavior of the last one. It is still not linear, but I don't think it can be because of the nature of partitioning. This code also contains the one-character change to correctly build the duplicate sets.
Spoiler:
Code:
from collections import defaultdict

# Construct map of books that are not duplicates
dups = [(3,4), (3,5), (3, 6), (7, 8), (8, 9)]
print 'books known not to be duplicates', dups
not_duplicate_of_map = defaultdict(set)
for t in dups:
  s = set(t)
  for b in t:
    not_duplicate_of_map[b] |= s

# Simulate a test
initial_dups = [i for i in xrange(1,10)]
initial_dups.sort()
print 'candidate duplicates', initial_dups

# Initial condition -- the group contains 1 set of all elements
results = [set(initial_dups)]
partitioning_ids = [None]
# Loop through the set of duplicates, checking to see if the entry is in a non-dup set
for one_dup in initial_dups:
    if one_dup in not_duplicate_of_map:
        # The entry is indeed in a non-dup set. We may need to partition
        for i,res in enumerate(results):
            if one_dup in res:
                # This result group contains the item with a non-dup set. If the item
                # was the one that caused this result group to partition in the first place,
                # then we must not partition again or we will make subsets of the group 
                # that split this partition off. Consider a group of (1,2,3,4) and
                # non-dups of [(1,2), (2,3)]. The first partition will give us (1,3,4)
                # and (2,3,4). Later when we discover (2,3), if we partition (2,3,4)
                # again, we will end up with (2,4) and (3,4), but (3,4) is a subset 
                # of (1,3,4). All we need to do is remove 3 from the (2,3,4) partition.
                if one_dup == partitioning_ids[i]:
                    results[i] = (res - not_duplicate_of_map[one_dup]) | set([one_dup])
                    continue
                # Must partition. We already have one partition, the one in our hand.
                # Remove the dups from it, then create new partitions for each of the dups.
                results[i] = (res - not_duplicate_of_map[one_dup]) | set([one_dup])
                for nd in not_duplicate_of_map[one_dup]:
                    # Only partition if the duplicate is larger than the one we are looking
                    # at. This is necessary because the non-dup set map is complete,
                    # map[2] == (2,3), and map[3] == (2,3). We know that when processing
                    # the set for 3, we have already done the work for the element 2.
                    if nd > one_dup and nd in res:
                        results.append((res - not_duplicate_of_map[one_dup]) | set([nd]))
                        partitioning_ids.append(nd)

print 'After partioning'
sr = []
for r in results:
    if len(r) > 1:
        sr.append(sorted(list(r)))
sr.sort()
for r in sr:
    print r
chaley is offline   Reply With Quote