MobileRead Forums - View Single Post

chaley · 04-25-2011, 05:16 AM

Quote:

Originally Posted by kiwidude

@chaley - actually I think I found another issue with that algorithm. I don't it actually "works".

For instance if I set
dups = [(3,4),(3,5)]
initial_dups=[1,2,3,4,5,6]

The results it gives me are:
[1,2,3,4,5],[1,2,4,5,6]

Look at the first group - it has 3 and 4 together. Yet they are specifically exempted from appearing together in a group, and instead 6 has been removed?

The problem you found has to do with how the non-dup sets were built. The code assumed that the sets were complete, but in your example they are not, and so needed to be 'unioned'. Changing the dup_map building code to do that (1 character), the partitioning algorithm produces the right answer.

Code:

books known not to be duplicates [(3, 4), (3, 5)]
candidate duplicates [1, 2, 3, 4, 5, 6]
After partioning [[1, 2, 3, 6], [1, 2, 4, 5, 6]]

That said, the partitioning algorithm I used was stupid. Although you have already done your own and therefore don't need this, for completeness here is a much better one that avoids the highly exponential behavior of the last one. It is still not linear, but I don't think it can be because of the nature of partitioning. This code also contains the one-character change to correctly build the duplicate sets.

Spoiler:

Code:

from collections import defaultdict

# Construct map of books that are not duplicates
dups = [(3,4), (3,5), (3, 6), (7, 8), (8, 9)]
print 'books known not to be duplicates', dups
not_duplicate_of_map = defaultdict(set)
for t in dups:
  s = set(t)
  for b in t:
    not_duplicate_of_map[b] |= s

# Simulate a test
initial_dups = [i for i in xrange(1,10)]
initial_dups.sort()
print 'candidate duplicates', initial_dups

# Initial condition -- the group contains 1 set of all elements
results = [set(initial_dups)]
partitioning_ids = [None]
# Loop through the set of duplicates, checking to see if the entry is in a non-dup set
for one_dup in initial_dups:
    if one_dup in not_duplicate_of_map:
        # The entry is indeed in a non-dup set. We may need to partition
        for i,res in enumerate(results):
            if one_dup in res:
                # This result group contains the item with a non-dup set. If the item
                # was the one that caused this result group to partition in the first place,
                # then we must not partition again or we will make subsets of the group 
                # that split this partition off. Consider a group of (1,2,3,4) and
                # non-dups of [(1,2), (2,3)]. The first partition will give us (1,3,4)
                # and (2,3,4). Later when we discover (2,3), if we partition (2,3,4)
                # again, we will end up with (2,4) and (3,4), but (3,4) is a subset 
                # of (1,3,4). All we need to do is remove 3 from the (2,3,4) partition.
                if one_dup == partitioning_ids[i]:
                    results[i] = (res - not_duplicate_of_map[one_dup]) | set([one_dup])
                    continue
                # Must partition. We already have one partition, the one in our hand.
                # Remove the dups from it, then create new partitions for each of the dups.
                results[i] = (res - not_duplicate_of_map[one_dup]) | set([one_dup])
                for nd in not_duplicate_of_map[one_dup]:
                    # Only partition if the duplicate is larger than the one we are looking
                    # at. This is necessary because the non-dup set map is complete,
                    # map[2] == (2,3), and map[3] == (2,3). We know that when processing
                    # the set for 3, we have already done the work for the element 2.
                    if nd > one_dup and nd in res:
                        results.append((res - not_duplicate_of_map[one_dup]) | set([nd]))
                        partitioning_ids.append(nd)

print 'After partioning'
sr = []
for r in results:
    if len(r) > 1:
        sr.append(sorted(list(r)))
sr.sort()
for r in sr:
    print r