MobileRead Forums - View Single Post

kiwidude · 04-16-2011, 08:34 AM

Quote:

Originally Posted by ldolse

Yeah, I realized there was one aspect that does indeed make it special - in the case where all the author records are identical (no different fuzzy matches) it's not a dupe (unlike a book), so those authors would need to not be marked. Probably not an insurmountable problem though, not sure if a different GUI would be required

I'm not entirely sure if we are talking about the same thing here or if you just had a typo. Say I have in my library these two books:

1. The Girl With the Dragon Tattoo - Stieg Larsson
2. The Girl With the Dragon Tattoo - S. Larsson

Finding these books is a duplicate book search. None of the algorithms to date will find this scenario.

Now say instead I have these two books

1. The Girl With the Dragon Tattoo - Stieg Larsson
2. The Girl Who Played With Fire - S. Larsson

This is a duplicate author search. None of the algorithms to date or proposed would detect this. This is why I said "ignore title, fuzzy author" would be the only way to get them together as a group.

Now if you wanted something that gave you less false positives, you would probably also want an "ignore title, similar author" to catch this situation (fuzzy author would also catch it but it might "bury" it in loads of results):
1. The Lord of the Rings - J.R.R. Tolkien
2. The Hobbit - J. R. R. Tolkien

So we have yet another permutation (you can see why I am tempted to treat title and author as independently set algorithms?)

Then you have to think about how are you going to resolve this scenario. For a start you will want to rename all instances of that author. Before you make that decision, you will want to check that they are indeed the same author. There are of course many genuine situations where J. Smith and J.L. Smith are different authors. So you would want all the books under each of those author names on screen to compare. An "ignore title, similar author" search would give you that. Though it may also (in the "ignore title, fuzzy author" case) give you a load of other authors too.

Another scenario
1. The Lord of the Rings - J.R.R. Tolkien
2. The Lord of the Rings - J. R. R. Tolkien

You would find this pairing with a "similar title, similar author" search we have in there currently. But on spotting it, you would again likely want to rename one of those author variations. Then perhaps run the whole duplicate search again.

The fuzzier that author match is, the more false positives you are going to get (but also the only way you will catch the genuine duplicates from variations in first name/initials).

This is rambling I know but perhaps it explains a little of the variations I think we (ultimately) need to cater for.