I thought fuzzy wuzzy was a bear, not a duck.
Quote:
Originally Posted by kiwidude
What I found however is that I think it is a bit too fuzzy/aggressive for a "similar" author search. Specifically what it does that goes across my personal desire for "similar" is that it removes initials. So for example "J. Smith" becomes "Smith" and would match with "W. Smith" in a duplicate search.
Which brings the question of how fuzzy wuzzy does each algorithm go
So - my suggestion is that "similar authors" will use the same logic as get_author_tokens, but not strip initials. So that will be left with handling removing punctuation, different spacing and reversal of names like LN,FN to FN LN.
|
Sounds good, at least in theory. It is very conservative, which will be what is needed in many situations. Usage will tell, I suppose.
Quote:
So W. Smith / Wayne Smith / Smith, W. would all match under either fuzzy proposal. However W. Smith / S. Smith would not return as a match under the second.
Or perhaps you have different ideas for "similar" and "fuzzy". What are your thoughts?
|
I am a bit confused about which is which. I think you are saying:
Similar: same name ignoring punctuation and word order
Fuzzy, alt 1: Strip initials (one letter words?). Match what is left.
Fuzzy, alt 2: At least one word matches (how long must the word be?) The first letters of other words must match. Note that using this algorithm, I think that Sam Wayne would match Wayne Smith. I don't see how you can avoid this, unless you start attaching great meaning to commas.
If I have this right, then I think I agree with you. Similar should be as described, which is very conservative.
Least fuzzy should be alt 2.
More fuzzy should be alt 1.
You might consider inserting soundex between least fuzzy and more fuzzy. It should work reasonably well, at least for names that are pronounced reasonably correctly in English.
Will try the plugin real-soon-now.