MobileRead Forums - View Single Post

kiwidude · 02-07-2011, 06:09 PM

Quote:

Originally Posted by Starson17

He's fixed it up and merged it into the trunk. Feel free to test it, if you can run from code.

Awesome - I've got it now.

I wanted to respond to a comment I made/Starson17's response in a different thread here for further discussion:

Quote:

Originally Posted by Starson17

Quote:

Originally Posted by kiwidude

Certainly the 1.0 version may "only" have the exact same comparison logic Starson's automerge functionality has - of exact match on author, fuzzy on title.

That's fine by me, but remember that the reason I didn't fuzzy match more aggressively is that automerge is AUTOmatic. An error in automerge meant a lost book format. If we are doing duplicate finding with manual review on a dialog screen with merging manually controlled to find the best format or not merge, we can be much more aggressive.

My emphasis was intended on "1.0" and "may" as it was inline with some of the ideas we have discussed on this thread. Agree totally that we will want at least longer term to have fuzzy matching on author and "fuzzier" matching on title. As you say if the user has an interactive opportunity to view formats and reject false positives in their own time then "fuzzy" becomes a good thing.

In my mind these are the reasons why people may be in a duplicate situation (there could well be others I haven't thought of):

users who have never discovered the automerge option - so this will be their first chance to try a duplicate comparison on their library but in a non-destructive fashion.
users who knew about it but kept it turned off because they didn't want duplicate formats automatically being discarded before they could view them
any situation in which the user edits the title or author - the result may make the book a duplicate of another record. They may have had automerge turned on when they added it, but their filename did not match the regular expression or was not "exact" enough for the logic to work at that time
users who get the next version of Calibre using the new "Create new record for duplicate format" option with automerge
users who have had automerge on, but had variations in the title that exceed the current "fuzzy" matching logic it has. Similar to #3 above, except the user has not yet identified that the title should be edited.
users who have duplicated authors due to very slight variations in their names, such as spaces, variations of first names/initials/punctuation etc. As the current automerge is (very rightly) conservative of doing exact match only on author, the slightest difference results in a duplicate.

So an initial duplicate finder release that reuses the exact same comparison logic automerge has today (exact match author, fuzzy title) will help out all but the last two situations. Which imho is the 90% scenario for users, but I have absolutely no facts to support this so feel free to disagree. It is certainly the easiest use case.

Also by keeping the same conservative matching logic, a user can have confidence that the duplicate results list that appears are "genuine" duplicate scenarios. Well unless their title or author is completely wrong of course, but that is something only visual inspection of the book format can identify.

The results of a duplicate search using this logic may contain groups that require less manual visual inspection to merge together than others (i.e. groups which have multiple books but not duplicate formats, and groups that do have duplicate formats). However there are still problems which prevent the plugin from being "automatic" in merging those "safer" ones. As Starson17 at least is aware the problem is the book metadata. If you have only just added a duplicate to your library, then the "oldest" by book date duplicate is most likely (but not always) the one which contains the metadata you want to keep (series information, comments, ISBN, cover, conversion settings etc). That behaviour would effectively match what happens with automerge turned on today when adding books and you add new formats of a book.

However what if (as frequently happens) both book records have been in your library for a while, so both have metadata assigned but they differ in content? Maybe they have different series names (or one has one, the other doesn't), etc, etc. That is why there are so many "merge" submenus - as a user you have been given the power to merge to cover lots of scenarios of wanting only certain data kept with a particular merge direction.

Which (slowly, sorry) brings me back to my initial suggestion in the thread of wanting to use the power of the library view for the plugin (rather than a popup dialog displaying duplicate search results). Doing so means the user has the full merge menus that exist and they know today. They can pop open the edit metadata dialog or make changes directly in the grid before they merge. They can roll up down comparing covers/comments in the book view. If they have custom columns like "Read yes/no" these will be visible to help identify which version to keep. Of course they also have all the existing ability to view formats. Sure we could duplicate most of this functionality into a popup dialog, but that's not a great long term solution imho.

What users don't have in the library view currently is a way of visually identifying duplicate groups. As Charles kindly suggested to me via email one possibility is to use a custom column, which stores a duplicate group number against all the potential candidates as a result of running the duplicate find. Perhaps you could also add a second column giving you some kind of informational message or severity (may not be needed). You would be able to use the tag browser and search capabilities of Calibre to query against/display your duplicate groups. We can also wrap that up in helper menu items in the plugin as well to make it easy to bring back your duplicates.

That imho is the only way for users to have sufficient information available and gui options to make the "hard decisions" about resolving a merge group. The plugin would take responsibility for populating the duplicate groups custom column. With presumably the ability to toggle the custom columns in/out of your library view. And you could launch different types of duplicate searches with the plugin - initially the "exact author, fuzzy title" logic, but eventually other "fuzzier" searches. You can take your time resolving the duplicates across multiple Calibre sessions without rerunning the search since there is no popup dialog. Or you can run it again across a different subsets of records/different matching algorithms etc.

Those are my long winded thoughts for now, comments appreciated as always. There are of course still issues, like clearing the duplicate group custom column values after a merge. And once you start supporting "fuzzier" matching algorithms, you have the issue of repeatedly looking at false positives unless we come up with a way of the user flagging exclusions over time.