MobileRead Forums - View Single Post

kiwidude · 04-12-2011, 11:32 AM

Hey Charles, thanks for the comments.

Yeah I was rather tired last night when I posted but wanted to ask for input before I crashed. I had come to the same conclusion of a dictionary by book of exclusions but wanted to make sure I wasn't missing some obvious alternative.

Part of my hesitation was thinking about ways users could mark not duplicates. For instance if I let it be a free-for-all, then in it could be a lot of data. Say for instance I allowed a user to mark any books they liked together as not duplicates. So I have a library of 1,000 books as a starting point, and I am confident I have no duplicates. If I selected the whole lot and said mark as not duplicates, then I would ensure that any duplicates that came up in future were not books I had considered. I could skip through the book groups with a very "loose" duplicate matching algorithm, then rather than resolving each group I could just resolve the ones I were dups, then select my whole library and mark it as not containing duplicates again.

From a user perspective this is kind of simple. From a data perspective it is a bit of a disaster, storing a cross-map table of every id with every other id

So to prevent that scenario, I would instead only allow a user to mark as not duplicates books in the found sets - if the selection you make is not found together in the same set with the search algorithm you can't add them as a duplicate. That would keep the data volumes down, at the expense of more clicks required by the user.

Perhaps a compromise would be an additional menu option of something like "mark all groups as not duplicates". So a user could still skip through their duplicate results found, and then if they are happy they never want to see those again they can with one click add the various pairings within those groups as marked pairs. That would be dramatically less data while minimising the user effort. Does that sound useful?

As for the presentation, yeah need to think about that one. It has the same issues (I think) as a book based view. You have a root book, and you have the related books marked as not duplicates. From a user perspective you perhaps initially want to see the set of all books marked as not duplicates, then when you click on one be able to see what books it is duplicates with. As you say the whole nasty issue of transitivity comes up when you visually display that information to consider the pairings.

Maybe the easiest way is to only allow the user to remove all exceptions for a book. So we have a menu option of "Show marked not duplicates" which displays every book that has some kind of a not duplicate relationship. Then a user can choose a book and choose "Remove from not duplicates" which breaks all associations other books have with it. So if I have (1,2) and (1,3) marked as not duplicates, I choose "Show" which will display 1, 2, 3. If I choose "Remove" on book 1, then both it's exclusion relationship with 2 and with 3 are deleted. If 2 & 3 have no exclusions with other books then they also get removed from the exclusion dictionary, so when it refreshes the view it would display no results.

So if a user wanted to remove just the (1,2) relationship, they would need to do the remove on book 2. Then when it refreshes it would still be showing book 1 and book 3. Without having some visual indication of the relationship pair graph between the books I think it is the simplest approach.

Any thoughts/objections?

04-12-2011, 11:32 AM	#13
kiwidude Calibre Plugins Developer Posts: 4,743 Karma: 2208556 Join Date: Oct 2010 Location: Australia Device: Kindle Oasis	Hey Charles, thanks for the comments. Yeah I was rather tired last night when I posted but wanted to ask for input before I crashed. I had come to the same conclusion of a dictionary by book of exclusions but wanted to make sure I wasn't missing some obvious alternative. Part of my hesitation was thinking about ways users could mark not duplicates. For instance if I let it be a free-for-all, then in it could be a lot of data. Say for instance I allowed a user to mark any books they liked together as not duplicates. So I have a library of 1,000 books as a starting point, and I am confident I have no duplicates. If I selected the whole lot and said mark as not duplicates, then I would ensure that any duplicates that came up in future were not books I had considered. I could skip through the book groups with a very "loose" duplicate matching algorithm, then rather than resolving each group I could just resolve the ones I were dups, then select my whole library and mark it as not containing duplicates again. From a user perspective this is kind of simple. From a data perspective it is a bit of a disaster, storing a cross-map table of every id with every other id So to prevent that scenario, I would instead only allow a user to mark as not duplicates books in the found sets - if the selection you make is not found together in the same set with the search algorithm you can't add them as a duplicate. That would keep the data volumes down, at the expense of more clicks required by the user. Perhaps a compromise would be an additional menu option of something like "mark all groups as not duplicates". So a user could still skip through their duplicate results found, and then if they are happy they never want to see those again they can with one click add the various pairings within those groups as marked pairs. That would be dramatically less data while minimising the user effort. Does that sound useful? As for the presentation, yeah need to think about that one. It has the same issues (I think) as a book based view. You have a root book, and you have the related books marked as not duplicates. From a user perspective you perhaps initially want to see the set of all books marked as not duplicates, then when you click on one be able to see what books it is duplicates with. As you say the whole nasty issue of transitivity comes up when you visually display that information to consider the pairings. Maybe the easiest way is to only allow the user to remove all exceptions for a book. So we have a menu option of "Show marked not duplicates" which displays every book that has some kind of a not duplicate relationship. Then a user can choose a book and choose "Remove from not duplicates" which breaks all associations other books have with it. So if I have (1,2) and (1,3) marked as not duplicates, I choose "Show" which will display 1, 2, 3. If I choose "Remove" on book 1, then both it's exclusion relationship with 2 and with 3 are deleted. If 2 & 3 have no exclusions with other books then they also get removed from the exclusion dictionary, so when it refreshes the view it would display no results. So if a user wanted to remove just the (1,2) relationship, they would need to do the remove on book 2. Then when it refreshes it would still be showing book 1 and book 3. Without having some visual indication of the relationship pair graph between the books I think it is the simplest approach. Any thoughts/objections?