MobileRead Forums - View Single Post

chaley · 02-09-2011, 11:42 AM

Quote:

Originally Posted by Starson17

I wonder if I've understood it correctly. The custom column would be is_multiple and populated with book id's of potential duplicates?

No. It is populated with duplicate set names. I think these are the same as what you are calling group numbers.

Quote:

Presumably each book in the set of potential dupes would have the same content in that column, in the same order,

A book will have the set_ids it belongs to. Order is irrelevant. A book can be in multiple duplicate sets, especially when using the not-duplicate processing.

Quote:

and books with no potential dupes would have nothing in that column?

Yes, no entries

Quote:

Highlighting would/could be used to highlight all members of a single dupe set?

Simply searching for a set id (group number) would find the potential duplicates. Use the highlight option if you want to see them in context.

Quote:

Since the column would be populated in known order, sorting by that column would put all dupes together?

No. The column is an is-multiple column (tags-like column). Sorting on the column is almost certainly not useful. Also, as a book can be in multiple duplicate sets, it isn't clear that sorting would say very much.

Quote:

That's very similar to my idea, except, I had in mind assigning a number for each dupe set, then highlighting odd numbered sets so that when sorted by the dupe set number, all members would be together for all dupe sets, and the highlighting would identify all dupe sets (the first set would be highlighted, the second even numbered set would not, the third odd numbered set would be highlighted, etc.)

Yes, it is similar to your numbering. However, given that a book can be in multiple sets, I am not sure about the odd/even highlighting. I think that searching for a dup set and then sorting how you want is the way to go.

Quote:

I wasn't sure of the best way to avoid repeat false positives. Simply removing a book from any future dupe groups has problems. You might want to prevent a book from appearing in one test because it was truly a false positive, but later want to run another type of dupe test that might truly find it as a dupe of some other book in this different test mode. You might add another book later that does match a book that was previously marked as a false positive.

I am proposing that the user tell us that two (or more) books are not duplicates. This means that regardless of what any test says, these two books should never be in the same duplicate group. I don't see a case where a user would say that the books are not duplicates, but later decide on the results of some other test that they are.

Quote:

As usual - those are just random thoughts.

I guess Kiwidude has his work cut out for him