MobileRead Forums - View Single Post

kiwidude · 04-20-2011, 10:01 AM

Ok, so I have an implementation put together for supporting "author duplicate" (ignore title) searches. And it seems to actually work without having to completely start all over again, which is both surprising and gratifying.

My plan was to add the following algorithms:
- ignore title, similar author
- ignore title, fuzzy author

However having implemented the first to reuse the same "similar author" logic that I am using for "similar title, similar author" I noticed some unexpected fuzziness

Specifically, for my initial implementation of "similar author" for this plugin to get up and running I decided just to invoke Kovid's author simplifying algorithm used for metadata retrieval (in ebooks/metadata/sources/base.py in the Source class of get_author_tokens()).

What I found however is that I think it is a bit too fuzzy/aggressive for a "similar" author search. Specifically what it does that goes across my personal desire for "similar" is that it removes initials. So for example "J. Smith" becomes "Smith" and would match with "W. Smith" in a duplicate search.

Which brings the question of how fuzzy wuzzy does each algorithm go

So - my suggestion is that "similar authors" will use the same logic as get_author_tokens, but not strip initials. So that will be left with handling removing punctuation, different spacing and reversal of names like LN,FN to FN LN.

Then the "fuzzy authors" algorithm, would be left to be more aggressive. Either it could attempt to determine a "last name" and ignore everything else (and yes I know there are lots of issues with determining the "last" name with Jr. etc but we could if wanted attempt to cater for some common cases). Or slightly more usefully it could take the last name and prefix it with one initial, being either the first letter of the first name or first initial, whatever is found.

So W. Smith / Wayne Smith / Smith, W. would all match under either fuzzy proposal. However W. Smith / S. Smith would not return as a match under the second.

Or perhaps you have different ideas for "similar" and "fuzzy". What are your thoughts?

The attached plugin version has no changes to the "similar" logic so you can see for yourself. Other changes I made to support ignore title logic:

A new menu item of Show author exemptions (so you can see author exemptions or book exemptions)
Manage exemptions dialog displays any author exemptions for the selected book
Choosing an author based search will display/expand the authors node in the tag browser (let me know if you like that and/or want it to actually highlight the author under consideration)
Removing an exemption using the right-click menu removes all book or author exemptions found for that selection. It seemed simpler from a user perspective than two different menu items, and you can see what it is removing in the details window
Various other internal tweaks to support all the changes

As per usual I may have accidentally introduced some new quirks with this version, but I really wanted to get something out there for feedback so your patience and understanding is appreciated

.

04-20-2011, 10:01 AM	#116
kiwidude Calibre Plugins Developer Posts: 4,733 Karma: 2197770 Join Date: Oct 2010 Location: Australia Device: Kindle Oasis	v 0.5 Beta - How fuzzy wuzzy wuzza duck? Ok, so I have an implementation put together for supporting "author duplicate" (ignore title) searches. And it seems to actually work without having to completely start all over again, which is both surprising and gratifying. My plan was to add the following algorithms: - ignore title, similar author - ignore title, fuzzy author However having implemented the first to reuse the same "similar author" logic that I am using for "similar title, similar author" I noticed some unexpected fuzziness Specifically, for my initial implementation of "similar author" for this plugin to get up and running I decided just to invoke Kovid's author simplifying algorithm used for metadata retrieval (in ebooks/metadata/sources/base.py in the Source class of get_author_tokens()). What I found however is that I think it is a bit too fuzzy/aggressive for a "similar" author search. Specifically what it does that goes across my personal desire for "similar" is that it removes initials. So for example "J. Smith" becomes "Smith" and would match with "W. Smith" in a duplicate search. Which brings the question of how fuzzy wuzzy does each algorithm go So - my suggestion is that "similar authors" will use the same logic as get_author_tokens, but not strip initials. So that will be left with handling removing punctuation, different spacing and reversal of names like LN,FN to FN LN. Then the "fuzzy authors" algorithm, would be left to be more aggressive. Either it could attempt to determine a "last name" and ignore everything else (and yes I know there are lots of issues with determining the "last" name with Jr. etc but we could if wanted attempt to cater for some common cases). Or slightly more usefully it could take the last name and prefix it with one initial, being either the first letter of the first name or first initial, whatever is found. So W. Smith / Wayne Smith / Smith, W. would all match under either fuzzy proposal. However W. Smith / S. Smith would not return as a match under the second. Or perhaps you have different ideas for "similar" and "fuzzy". What are your thoughts? The attached plugin version has no changes to the "similar" logic so you can see for yourself. Other changes I made to support ignore title logic: A new menu item of Show author exemptions (so you can see author exemptions or book exemptions) Manage exemptions dialog displays any author exemptions for the selected book Choosing an author based search will display/expand the authors node in the tag browser (let me know if you like that and/or want it to actually highlight the author under consideration) Removing an exemption using the right-click menu removes all book or author exemptions found for that selection. It seemed simpler from a user perspective than two different menu items, and you can see what it is removing in the details window Various other internal tweaks to support all the changes As per usual I may have accidentally introduced some new quirks with this version, but I really wanted to get something out there for feedback so your patience and understanding is appreciated . Last edited by kiwidude; 04-25-2011 at 02:16 PM. Reason: Removed attachment as later version in this thread