MobileRead Forums - View Single Post

chaley · 04-28-2011, 01:51 AM

Quote:

Originally Posted by kiwidude

My initial thought was also that you would have to store each exemption group as a list of it's members which I assume you mean the same way. So your persistence of groups would look be a list of lists like:
[(1,2,3),(10,11,12,13,14,15),(1,5,6),...]

Yes

Quote:

You say that this is "compatible" with the partitioning algorithm. I don't see how without some surgery on it, unless you turn the above into an exemption map first? If that is the case, then I think we get back on the same page of the existing issue of the permutations, which only a threshold is going to resolve. Or were you thinking there is some way of doing this without an exemption map?

First, that storage mechanism is compatible with the existing not_dup_map construction because of the bug you found. It now generates the union of the sets. That is what we want it to do.

I am not convinced that the non_dup map memory is a problem in normal usage, but it does get problematic when people do silly things. The space problem you mention arises because the current non_dup map is a cartesian product of all the exemption groups, which has excellent performance properties but perhaps less than ideal space properties. In our case the problem might be manageable. An exemption group of 200 elements will generate 200 sets of 200 elements, or 40,000 items. 1000 elements will generate 1000 sets, or 1,000,000 elements. Neither of these is outrageously large. To verify, I measured memory use of my test code. An exemption group of 1000 elements used 21MB of memory

Unfortunately, if someone builds an exemption group of 5,000 elements, things start getting dicey. That number requires 700MB of memory. 8,000 elements requires 1.1GB.

What this shows is that for a reasonable number of exemptions, the memory usage isn't out of line. A 'reasonable number' seems to be around 2000 unique elements in the union of the exemption groups.

If you are uncomfortable with the limit and the non-graceful degradation, then another implementation suggests itself. This is the surgery you mentioned. Instead of initially building a cartesian product, we build a list of sets. When we need a non_dup map entry, we do the union of the sets. This alternate implementation trades off performance for memory stability. The amount of memory required is only that needed to store the exemption groups. The performance penalty is doing the union when required instead of once in the beginning.

Considering the performance penalty: we are again faced with the 'normal usage' issue. This implementation would do the equivalent of building the exemption map per duplicate group instead of once. If there is one duplicate group, then no penalty. If there are 1,000 duplicate groups, then we pay the cost of building the non_dup map 1,000 times.

Thus the question: will we in fact encounter situations where we have both thousands of duplicate groups and exemption lists with thousands of elements? I hope not. In any event, this algorithm degrades more gracefully than the other one. One reason for this is that per duplicate group, the union is done only N times where N is the number of elements in the duplicate group. If there are thousands of duplicate groups, they should be rather small, mitigating the performance problem.

Sample implementation under the spoiler. There aren't many changes. The not_duplicate_of_map is now a map to a list of sets. The union of these sets is computed when required, once per item in a duplicate group. You will see that the test has one exemption group of 100,000 elements.

Spoiler:

Code:

from collections import defaultdict

# Construct map of books that are not duplicates
dups = [(3,4), (3,5), (3, 6), (7, 8), (8, 9), tuple([i for i in xrange(1,100000)])]
#print 'books known not to be duplicates', dups
not_duplicate_of_map = defaultdict(list)
for t in dups:
  s = set(t)
  for b in t:
    not_duplicate_of_map[b].append(s)

# Simulate a test
initial_dups = [i for i in xrange(1,100)]
initial_dups.sort()
print 'candidate duplicates', initial_dups


def merge_sets(list_of_sets):
    if len(list_of_sets) == 0:
        return set()
    if len(list_of_sets) == 1:
        return list_of_sets[0]
    return set().union(*list_of_sets)
    
# Initial condition -- the group contains 1 set of all elements
results = [set(initial_dups)]
partitioning_ids = [None]
# Loop through the set of duplicates, checking to see if the entry is in a non-dup set
for one_dup in initial_dups:
    if one_dup in not_duplicate_of_map:
        ndm_entry = merge_sets(not_duplicate_of_map[one_dup])
        # The entry is indeed in a non-dup set. We may need to partition
        for i,res in enumerate(results):
            if one_dup in res:
                # This result group contains the item with a non-dup set. If the item
                # was the one that caused this result group to partition in the first place,
                # then we must not partition again or we will make subsets of the group 
                # that split this partition off. Consider a group of (1,2,3,4) and
                # non-dups of [(1,2), (2,3)]. The first partition will give us (1,3,4)
                # and (2,3,4). Later when we discover (2,3), if we partition (2,3,4)
                # again, we will end up with (2,4) and (3,4), but (3,4) is a subset 
                # of (1,3,4). All we need to do is remove 3 from the (2,3,4) partition.
                if one_dup == partitioning_ids[i]:
                    results[i] = (res - ndm_entry) | set([one_dup])
                    continue
                # Must partition. We already have one partition, the one in our hand.
                # Remove the dups from it, then create new partitions for each of the dups.
                results[i] = (res - ndm_entry) | set([one_dup])
                for nd in ndm_entry:
                    # Only partition if the duplicate is larger than the one we are looking
                    # at. This is necessary because the non-dup set map is complete,
                    # map[2] == (2,3), and map[3] == (2,3). We know that when processing
                    # the set for 3, we have already done the work for the element 2.
                    if nd > one_dup and nd in res:
                        results.append((res - ndm_entry) | set([nd]))
                        partitioning_ids.append(nd)

print 'After partioning'
sr = []
for r in results:
    if len(r) > 1:
        sr.append(sorted(list(r)))
sr.sort()
for r in sr:
    print r

Quote:

The last paragraph you lost me again. Are you saying we would store both the exemptions as groups and with some pairs indicating any removals from the groups? Or that you partition the exemption groups with a removal? Or something else?

My suggestion is silly. Ignore it.

I think the last question is how to handle removal of exemptions. Not storing pairs complicates things a bit. I think that the following works.

Assume that the user selects 'manage exemptions for this book'. First problem: construct the exemption list for the book. This is done in the same way as duplicate partioning: construct the non_dup map, then do the union of the sets for the master book.

Second problem: the user checks a book or two (or 1,000), asking that the non_dup exemption be removed for those books. What the user is really asking for is that the master book and the selected books never appear together in any exemption group. To satisfy this wish, you can:
- iterate over all the existing exemption groups, removing the master book. Exemption groups with length <= 1 are removed.
- create a new exemption group consisting of the master book and all unchecked books in the list.

This process works because what is initially displayed in the union of all the exemption groups containing the master book. After deleting the master book from all the exemption groups, that union is empty. Adding the new exemption group makes the union correct again.

One possible issue is that over time, the number of exemption groups could grow. This is certainly true if the user selects a zillion books and marks them as exempt, then goes over that list over time to unmark them. Performance will degrade in this case, but I am not sure by how much.