MobileRead Forums - View Single Post

chaley · 05-02-2011, 07:54 AM

Quote:

Originally Posted by kiwidude

The downside is that it is "another thing" you must run from time to time.

That was my concern that led to the 'perhaps it should be here'.

Quote:

From an implementation perspective if it was in Find Duplicates I was just thinking I could compute two hashes for authors names - one as per now, one as reversed. Store the alternate hash in a separate dictionary, then at the end do a pass through that dictionary testing to see if the reversed hash value exists in the candidate groups and if so merge the results together. Something like that anyways, not given it masses of thought.

If we are trying to find duplicate books where the author names are swapped around, then simply adding all forms of the author to the candidate map should work. If there is a book X with author "A B", and another book X with author "B, A", then adding both permutations for X will generate two identical duplicate groups. These should be pruned, assuming you have added the set pruning stuff.

If we are doing author checks, then the same thing still works (I think). Adding all permutations of an author will generate two duplicate groups for every author that appears in both forms. Again, set pruning will take care of this.

I don't think there is a memory issue here, because number of additional candidate sets == the number of authors.

Edit: The number of additional candidate sets is equal to number of title/author pairs, not the number of authors. We will see if this is too big.