MobileRead Forums - View Single Post

chaley · 04-11-2011, 08:22 AM

We must first decide what it means to say two books are not duplicates. My position is that books marked as not duplicates are not duplicates, independent of what algorithm is used. Note that the 'not duplicate' relationship is not transitive: if books 1 & 2 are not duplicates and books 2 & 3 are not duplicates, we can say nothing about whether books 1 and 3 are duplicates.

I think that the natural implementation is to have pairs of books that are known to be 'not duplicates'. When sets of duplicates are built, they must be partitioned so that pairs of 'not duplicates' are never in the same set. Using the above example, if the algorithm produces a set [1, 2, 3], then partitioning will produce a set [1, 3]. If we have a duplicate set [1, 2, 3, 4] with not-duplicate [1, 2] and [3, 4], then the result will be four sets [1, 3], [1, 4], [2, 3], [2, 4]. The algorithm I posted before does this partitioning.

I am not convinced that a custom column is the best place to keep the 'not duplicate' pairs. First, the data would be duplicated in both books, creating a maintenance problem (change one means the other must change). Second, the data must be 'multiple' because a book can participate in multiple pairs. I think I am saying that the plugin must accept a selection of books that are not dups and write that information to its own storage.

As for whether or not to use a persistent column for the results: I tend toward persistence. I see three advantages of using a custom column. 1) I can process a group, then delete the group marker to indicate I am done. Running the algorithm again could find these books again, unless I am meticulous about marking not-dups. 2) I can do the work over time, knowing where I stopped. 3) I can select all the groups that a given book is in, giving a books view.

For this last one, the plugin could help by writing the transitive group information into a custom column. This would solve the sorting problem because the values written could be sortable: the base book could be #1, with the rest of the books being some other number. The problem you raise regarding books 2 and 3 'not duplicates' is also real, but I think it is a consequence of using book mode. Book-based browsing says something about a relationship between the base book and any other book, but nothing about the relationship between books that are not the base.

Have fun!

04-11-2011, 08:22 AM	#3
chaley Grand Sorcerer Posts: 12,525 Karma: 8065948 Join Date: Jan 2010 Location: Notts, England Device: Kobo Libra 2	not duplicates We must first decide what it means to say two books are not duplicates. My position is that books marked as not duplicates are not duplicates, independent of what algorithm is used. Note that the 'not duplicate' relationship is not transitive: if books 1 & 2 are not duplicates and books 2 & 3 are not duplicates, we can say nothing about whether books 1 and 3 are duplicates. I think that the natural implementation is to have pairs of books that are known to be 'not duplicates'. When sets of duplicates are built, they must be partitioned so that pairs of 'not duplicates' are never in the same set. Using the above example, if the algorithm produces a set [1, 2, 3], then partitioning will produce a set [1, 3]. If we have a duplicate set [1, 2, 3, 4] with not-duplicate [1, 2] and [3, 4], then the result will be four sets [1, 3], [1, 4], [2, 3], [2, 4]. The algorithm I posted before does this partitioning. I am not convinced that a custom column is the best place to keep the 'not duplicate' pairs. First, the data would be duplicated in both books, creating a maintenance problem (change one means the other must change). Second, the data must be 'multiple' because a book can participate in multiple pairs. I think I am saying that the plugin must accept a selection of books that are not dups and write that information to its own storage. As for whether or not to use a persistent column for the results: I tend toward persistence. I see three advantages of using a custom column. 1) I can process a group, then delete the group marker to indicate I am done. Running the algorithm again could find these books again, unless I am meticulous about marking not-dups. 2) I can do the work over time, knowing where I stopped. 3) I can select all the groups that a given book is in, giving a books view. For this last one, the plugin could help by writing the transitive group information into a custom column. This would solve the sorting problem because the values written could be sortable: the base book could be #1, with the rest of the books being some other number. The problem you raise regarding books 2 and 3 'not duplicates' is also real, but I think it is a consequence of using book mode. Book-based browsing says something about a relationship between the base book and any other book, but nothing about the relationship between books that are not the base. Have fun!