MobileRead Forums - View Single Post

kiwidude · 04-10-2011, 03:12 PM

Call me a masochist, but I can't resist the challenge of getting something up and running for this. I've tried to say no, but I can't. It is a disease.

We had a bunch of really good discussions in this on the Duplicate Detection thread - Starson17 & Kovid implemented the automerge changes which occupied the first four pages, then the posts more relevant to the plugin begain in earnest with some good stuff being thrown around. I considered continuing this post on that thread, but it is clear it is now all about plugin design decisions and has no place in the library mgmt forum.

So having reviewed the thread, I have had a few thoughts.

Re-using the library view is a no-brainer for me now, lets keep the plugin as simple as possible and let the user retain all the custom columns, sorting, right-clicks etc they have today to decide on how to resolve a merge.
There was much discussion about reviewing results by "duplicate set" versus "by book that has duplicates". There seemed pros and cons with both. The immediate issue I have with the latter is how you can visually indicate in the gui which book is the "root" that you are considering, since you cannot rely on sorting. The only way I could think of would be a search restriction on that subset of books, then using highlighting to mark the root book in the results. But that could be messy for the user to exit out of. Unless you have other ideas I will start off with just set based result reviewing for now and we can add the other later if desired.
The plugin will display one duplicate "result" at a time (be it a set or book based), and via a right-click/keyboard shortcut move to the next result.
When a user chooses the "Find duplicates" option they get a popup dialog letting them choose the algorithm to apply (assuming more than one available eventually). Eventually they may also get the option of whether to review the results by set or by book and other "stuff" like automerging.
As the user can merge results as they go, the "next result" navigation will need to be fairly dynamic. I think the duplicate result sets should be held in memory by the plugin, rather than persisted into a custom column with a group number. We now have the "marked" feature to be able to display books matching a criteria. So when a user moves to the next result, I can grab the next set from memory, check whether any of the books have been deleted (resulting from a merge or delete), and if still have more than one then "mark" the books and display in the GUI. Simples.
Taking this approach will mean that if they restart Calibre again they must run their dup search again to rebuild the in-memory result sets. However as we anticipate this being fairly quick I don't forsee any issues with that.
I like Starson17's suggestion of a custom tags column that stores some sort of alg01-bookid1-bookid2 combination when a user right-clicks to indicate a false positive. Presumably you would only need to store this on the book with the lowest id if the dup finding algorithms also sorts by ids. There is the issue of what if the user subsequently merges that book into bookid4? I think in that situation the false positive record should become invalid? Without this being completely integrated into the merge code of Calibre, I think the dup checking should just take responsiblity for clearing out such "dud" false positive entries when it finds them.
To begin with I will only run one algorithm at a time, not multiple at once. So a user would run the "exact author, fuzzy title" algorithm (i.e. automerge) and have a pretty high confidence that they will be able to merge results found. Then after reviewing their results they would run the "fuzzy author, fuzzy title" algorithm or whatever, and know that they may get a lot of results which will require far more careful identification of false positives. Mixing the results together particularly if we do a book based review could get horribly confusing?
Due to all the complexities of choosing "who wins" I don't want to actually perform the merge or offer that as an option at this point - it is up to the user to go through each set and do the merge right-clicks, move to the next set etc. Maybe down the road there might be some kind of "automerge" suboption based on choices like "oldest is the master" or whatever but will avoid this initially at least.

You guys have any further thoughts on this before I take the plunge?

04-10-2011, 03:12 PM	#1
kiwidude Calibre Plugins Developer Posts: 4,735 Karma: 2208556 Join Date: Oct 2010 Location: Australia Device: Kindle Oasis	Duplicate detection plugin Call me a masochist, but I can't resist the challenge of getting something up and running for this. I've tried to say no, but I can't. It is a disease. We had a bunch of really good discussions in this on the Duplicate Detection thread - Starson17 & Kovid implemented the automerge changes which occupied the first four pages, then the posts more relevant to the plugin begain in earnest with some good stuff being thrown around. I considered continuing this post on that thread, but it is clear it is now all about plugin design decisions and has no place in the library mgmt forum. So having reviewed the thread, I have had a few thoughts. Re-using the library view is a no-brainer for me now, lets keep the plugin as simple as possible and let the user retain all the custom columns, sorting, right-clicks etc they have today to decide on how to resolve a merge. There was much discussion about reviewing results by "duplicate set" versus "by book that has duplicates". There seemed pros and cons with both. The immediate issue I have with the latter is how you can visually indicate in the gui which book is the "root" that you are considering, since you cannot rely on sorting. The only way I could think of would be a search restriction on that subset of books, then using highlighting to mark the root book in the results. But that could be messy for the user to exit out of. Unless you have other ideas I will start off with just set based result reviewing for now and we can add the other later if desired. The plugin will display one duplicate "result" at a time (be it a set or book based), and via a right-click/keyboard shortcut move to the next result. When a user chooses the "Find duplicates" option they get a popup dialog letting them choose the algorithm to apply (assuming more than one available eventually). Eventually they may also get the option of whether to review the results by set or by book and other "stuff" like automerging. As the user can merge results as they go, the "next result" navigation will need to be fairly dynamic. I think the duplicate result sets should be held in memory by the plugin, rather than persisted into a custom column with a group number. We now have the "marked" feature to be able to display books matching a criteria. So when a user moves to the next result, I can grab the next set from memory, check whether any of the books have been deleted (resulting from a merge or delete), and if still have more than one then "mark" the books and display in the GUI. Simples. Taking this approach will mean that if they restart Calibre again they must run their dup search again to rebuild the in-memory result sets. However as we anticipate this being fairly quick I don't forsee any issues with that. I like Starson17's suggestion of a custom tags column that stores some sort of alg01-bookid1-bookid2 combination when a user right-clicks to indicate a false positive. Presumably you would only need to store this on the book with the lowest id if the dup finding algorithms also sorts by ids. There is the issue of what if the user subsequently merges that book into bookid4? I think in that situation the false positive record should become invalid? Without this being completely integrated into the merge code of Calibre, I think the dup checking should just take responsiblity for clearing out such "dud" false positive entries when it finds them. To begin with I will only run one algorithm at a time, not multiple at once. So a user would run the "exact author, fuzzy title" algorithm (i.e. automerge) and have a pretty high confidence that they will be able to merge results found. Then after reviewing their results they would run the "fuzzy author, fuzzy title" algorithm or whatever, and know that they may get a lot of results which will require far more careful identification of false positives. Mixing the results together particularly if we do a book based review could get horribly confusing? Due to all the complexities of choosing "who wins" I don't want to actually perform the merge or offer that as an option at this point - it is up to the user to go through each set and do the merge right-clicks, move to the next set etc. Maybe down the road there might be some kind of "automerge" suboption based on choices like "oldest is the master" or whatever but will avoid this initially at least. You guys have any further thoughts on this before I take the plunge?