View Single Post
Old 04-20-2011, 10:26 AM   #117
chaley
Grand Sorcerer
chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.
 
Posts: 12,471
Karma: 8025600
Join Date: Jan 2010
Location: Notts, England
Device: Kobo Libra 2
I thought fuzzy wuzzy was a bear, not a duck.

Quote:
Originally Posted by kiwidude View Post
What I found however is that I think it is a bit too fuzzy/aggressive for a "similar" author search. Specifically what it does that goes across my personal desire for "similar" is that it removes initials. So for example "J. Smith" becomes "Smith" and would match with "W. Smith" in a duplicate search.

Which brings the question of how fuzzy wuzzy does each algorithm go

So - my suggestion is that "similar authors" will use the same logic as get_author_tokens, but not strip initials. So that will be left with handling removing punctuation, different spacing and reversal of names like LN,FN to FN LN.
Sounds good, at least in theory. It is very conservative, which will be what is needed in many situations. Usage will tell, I suppose.
Quote:
So W. Smith / Wayne Smith / Smith, W. would all match under either fuzzy proposal. However W. Smith / S. Smith would not return as a match under the second.

Or perhaps you have different ideas for "similar" and "fuzzy". What are your thoughts?
I am a bit confused about which is which. I think you are saying:
Similar: same name ignoring punctuation and word order
Fuzzy, alt 1: Strip initials (one letter words?). Match what is left.
Fuzzy, alt 2: At least one word matches (how long must the word be?) The first letters of other words must match. Note that using this algorithm, I think that Sam Wayne would match Wayne Smith. I don't see how you can avoid this, unless you start attaching great meaning to commas.

If I have this right, then I think I agree with you. Similar should be as described, which is very conservative.
Least fuzzy should be alt 2.
More fuzzy should be alt 1.

You might consider inserting soundex between least fuzzy and more fuzzy. It should work reasonably well, at least for names that are pronounced reasonably correctly in English.

Will try the plugin real-soon-now.
chaley is offline   Reply With Quote