MobileRead Forums - View Single Post

kiwidude · 05-02-2011, 08:43 AM

An interesting point - given that I have been dishing out "this is not a duplicate issue, it should be in Quality Check" answers for others it is only fair I be made a similar suggestion to myself

I guess what I saw as being slightly different in this situation is that "Author Duplicate" searches are all about finding similar variants of the same name. We currently find A B and B, A (as well as A C. B etc etc). So I guess my thought on this was that if you are showing A B and B,A then why not also show A,B and B A. That's where I would suggest there is an argument to say the check "could" have something to do with duplicates. However as with my own post edits you can see how such thinking can lead to it pervading the title based searches as well and the line becomes very grey.

However I completely agree that putting it as a Quality Check function is a consistent alternate approach. After all if we are saying that QC should try to detect titles having series info in them, and (one day) titles and authors being the wrong way around, why not also have a check for author names reversed. Particularly as it already has the slight variants such as checks for authors with/without commas. The downside is that it is "another thing" you must run from time to time.

From an implementation perspective if it was in Find Duplicates I was just thinking I could compute two hashes for authors names - one as per now, one as reversed. Store the alternate hash in a separate dictionary, then at the end do a pass through that dictionary testing to see if the reversed hash value exists in the candidate groups and if so merge the results together. Something like that anyways, not given it masses of thought.

05-02-2011, 08:43 AM	#250
kiwidude Calibre Plugins Developer Posts: 4,741 Karma: 2208556 Join Date: Oct 2010 Location: Australia Device: Kindle Oasis	An interesting point - given that I have been dishing out "this is not a duplicate issue, it should be in Quality Check" answers for others it is only fair I be made a similar suggestion to myself I guess what I saw as being slightly different in this situation is that "Author Duplicate" searches are all about finding similar variants of the same name. We currently find A B and B, A (as well as A C. B etc etc). So I guess my thought on this was that if you are showing A B and B,A then why not also show A,B and B A. That's where I would suggest there is an argument to say the check "could" have something to do with duplicates. However as with my own post edits you can see how such thinking can lead to it pervading the title based searches as well and the line becomes very grey. However I completely agree that putting it as a Quality Check function is a consistent alternate approach. After all if we are saying that QC should try to detect titles having series info in them, and (one day) titles and authors being the wrong way around, why not also have a check for author names reversed. Particularly as it already has the slight variants such as checks for authors with/without commas. The downside is that it is "another thing" you must run from time to time. From an implementation perspective if it was in Find Duplicates I was just thinking I could compute two hashes for authors names - one as per now, one as reversed. Store the alternate hash in a separate dictionary, then at the end do a pass through that dictionary testing to see if the reversed hash value exists in the candidate groups and if so merge the results together. Something like that anyways, not given it masses of thought.