View Single Post
Old 02-09-2011, 04:58 AM   #63
chaley
Grand Sorcerer
chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.
 
Posts: 11,742
Karma: 6997045
Join Date: Jan 2010
Location: Notts, England
Device: Kobo Libra 2
Kiwidude mentioned an email I sent him on this issue where I suggested using a custom column. Here is it, verbatim
Spoiler:
I saw some discussion of doing some sort of duplicate detection. I confess that I didn't read it carefully, so what I say below may be meaningless.

It seems to me that the goal is to partition the library into sets of books that might be duplicates of each other. The question that immediately arises is 'how does one determine that possibility?" I propose a set of functions that produce comparable values.

I see four functions:
- fuzzy title. The function returns the fuzzed title. Several people have ideas on what this computation might look like. Perhaps more than one fuzz algorithm might be offered.

- fuzzy author. Similar to title

- soundex title. Returns a soundex (sounds like) string. Not sure if it should operate on the straight title or the fuzzy title.

- soundex author. Similar to soundex title.

The user would pick which functions are to be used to determine if two books are duplicates. The more functions used, the more exact the match.


The algorithm:

Code:
for book in books:
  compute functions on book. r = concatenated results in a known order
  if r not in somedict:
    somedict[r] = set()
  somedict[r].add(book.id)
After running this algorithm, we have a dict of sets. If a set has more than one item in it, we have a potential duplicate. I propose that this partitioning is exposed to the user through populating an is_multiple custom column.

algorithm:

Code:
empty the custcol of all information.
group_number = 1
for k in somedict:
  if len(somedict[k]) > 1:
    for id in somedict[k]:
      add string 'group_'+str(group_number) to the custcol for book id
When done, each set of potential duplicates is represented by an item in the custom column, available in the tag browser and through other methods. The user would select a group, decide whether or not any of the members are duplicates, and take the appropriate action. When done, the tag is deleted (this might be something that a plugin could do conveniently).

By using the highlight checkbox in conjunction with the search, the user can also see books 'near' the various duplicates. Not sure if this is useful.

Performance should be acceptable. Computing the dict of sets has performance linearly dependent on the number of books in the library. Adding the tags has performance linearly dependent on the number of potential duplicates. My guess is that for a 10,000 book library, this computation shouldn't take more than 10 to 20 seconds, especially if commit is set to false when adding the tags. Once the last_changed date is available, the computations can be cached using the per-book storage mechanism, and recomputed if the metadata has changed.


In further conversations with kiwidude, the question of false positives came up. My suggestion would be to permit the user to say that two (or more?) given books are not duplicates. This information would be used by the duplicate detector to ensure that those books never appear together in a duplicate-book partition. The performance of this check would be very good if set arithmetic is used. Something like (in pseudo-code)

Code:
for each partition_set created by the dup finder
  for book_id in partition_set
    result_set = partition_set - books_are_not_dups[book_id]
    if result_set:
      create a real partition label
      for b_id in result_set:
        add partition label to the cust col for b_id
After running this test, we would have a custom tags-like column containing partition IDs where each ID marks a set of books that are potentially duplicates. Books that are declared not to be duplicates will never appear together in a partition.
chaley is offline   Reply With Quote