MobileRead Forums - View Single Post

chaley · 02-10-2011, 07:47 AM

Quote:

Originally Posted by chaley

In further conversations with kiwidude, the question of false positives came up. My suggestion would be to permit the user to say that two (or more?) given books are not duplicates. This information would be used by the duplicate detector to ensure that those books never appear together in a duplicate-book partition. The performance of this check would be very good if set arithmetic is used. Something like (in pseudo-code)

snip

After a bit of research and thought, I realized that a) the above algorithm doesn't work, and b) this is really a graph theory problem. The duplicate sets are equal to the number of different paths through a graph of nodes (books) in the test result, removing edges between nodes that are known not to be duplicates.

An example implementation is under the spoiler.

Spoiler:

This implementation depends on the fact that the items in the graph are numbers and can be fully ordered, so it is easy to prune duplicate paths simply by never traversing an edge to a node less than the one in hand. As the list is sorted, such an edge would already have been traversed in the other direction, so the algorithm does not need to deal with discovering both (1,2,3) and (2,3,1).