Duplicate detection plugin - Page 8

chaley · 04-19-2011, 07:19 AM

Quote:

Originally Posted by kiwidude

I would like to rip all my filth out and instead directly hook into the triggered signal of the clear search button action. You have any objections/thoughts on that? I should have pulled the pin on my current hacks and proposed this days ago, but I was playing whack-a-mole with the event triggering instead of a fresh perspective.

That seems fine to me.

You might also want to hook into gui.search_restriction.currentIndexChanged(int), so you will know if the user changes the restriction.

Quote:

So to make this plugin more complete/useful imho we *need* an ignore title based search.

I fully agree with this.

Quote:

I've only started last night thinking through all the implications and how it would fit. For instance when you are reviewing groups of authors, you are not going to want the "show all duplicates/highlight mode" option - instead it will be one group at a time and then the tag browser to filter within that group as you like or rename authors etc. So the Find duplicates dialog either needs a different dialog/menu option, or rearranging so that the options of how to view the results is either disabled or made a suboption of book based searches.

Well...

Doesn't this depend on my mental model at the time? If I am really looking for duplicate books and choose 'only author', then I might want to use groups and highlighting. Yes, this is a very fuzzy search, but that is what I asked for.

However, if my mental model is 'looking for author problems', then yes, I want to use group-at-a-time mode, and play with the tag browser etc. I also might want to use group-a-a-t mode if I am checking other metadata such as series or tagging issues.

What is wrong with adding an 'ignore title, fuzzy author' combination and leaving it up to me to choose how I want to see them? You might want to change the default to g-a-a-t, but I don't think you should prevent me from using highlighting.

Quote:

However I am convinced we do need ignore title searches, and if I have to rewrite the way I have done the code so far to support them then better to do that now and get it sorted while it is fresh in my mind than down the track imho.

Oh yes, do get it right before you have users.

kiwidude · 04-19-2011, 07:40 AM

Quote:

Originally Posted by chaley

You might also want to hook into gui.search_restriction.currentIndexChanged(int), so you will know if the user changes the restriction.

Yeah I had wondered about that. Won't that signal also get fired when I change the restrictions, so I will still need to unhook/rehook around when my plugin makes changes to those right? At least that is isolated and controllable rather than the clear signal I was attempting to use.

Quote:

What is wrong with adding an 'ignore title, fuzzy author' combination and leaving it up to me to choose how I want to see them? You might want to change the default to g-a-a-t, but I don't think you should prevent me from using highlighting...

I see your argument and once again a good point. That's what I get for thinking out loud rather than finishing my thoughts offline first

. You are correct, if the user has the desire to see potentially "lots of books" on screen at once then why not. It may work a lot faster for them to review all of the authors that came back when looking at the tag browser this way and then focus in on their "most likely" suspects. Then when they are happy they can just do the "Mark all groups as exempt" and the author exemptions get created, job done. Or they can go through the highlighted groups one by one, they just won't be able to see that particular set of authors isolated in the tag browser. As you say, user's choice...

chaley · 04-19-2011, 08:03 AM

Quote:

Originally Posted by kiwidude

Yeah I had wondered about that. Won't that signal also get fired when I change the restrictions, so I will still need to unhook/rehook around when my plugin makes changes to those right? At least that is isolated and controllable rather than the clear signal I was attempting to use.

Yes, you will see the signal. I can't deactivate it, because that would affect all the other listeners.

After some thought, I think you would be better served to hook into activated instead of currentIndexChanged. If the user chooses index 1 (current search), activated will fire and the restriction will change to the current search. However, currentIndexChanged will not fire, because the index didn't change.

I don't know if you should unhook/rehook, or if you should have an 'ignore' flag. By the latter I mean using a double-signal arrangement, something like:

Code:

def do_search_restriction_activated(self, idx):
  if not ignoring_signals:
    self.restriction_changed.emit()

def do_restriction_changed(self):
  do what you need to do

pyqtSignal restriction_changed()
def __init__()
  self.ignoring_signals = False
  gui.search_restriction.activated[int].connect(self.do_search_restriction_activated)
  self.restriction_changed.connect(self.do_restriction_changed)

Then in your code, when you mess with restrictions you set self.ignoring_signals = True, do what you need to do, then set it to False. As your code uses the private signal, it will see it only when it should. You can connect multiple things to it, without worrying about adding the test to each connected method.

FWIW

kiwidude · 04-19-2011, 08:20 AM

Quote:

Originally Posted by chaley

After some thought, I think you would be better served to hook into activated instead of currentIndexChanged. If the user chooses index 1 (current search), activated will fire and the restriction will change to the current search. However, currentIndexChanged will not fire, because the index didn't change.

Yeah I confess to finding that behaviour in search restrictions a bit quirky

.

Say I do a search for "dragon", then choose "*Current search" in restriction dropdown.

In the dropdown my only options now are blank and "dragon" (plus any other saved searches obviously). The *Current search option has been replaced with dragon.

Now in the restriction dropdown if I choose "dragon" in actual fact that clears the restrictions and puts "*Current search" in it's place.

An alternative suggestion (since it is only fair if I say something I found unexpected is to suggest an alternative, however crap it may be) is:
(a) *Current search always stayed there
(b) a new entry of *dragon appeared below *Current search

So if a user chooses *dragon again in the dropdown, nothing happens.
If a user has typed another search and they want to immediately apply it, they can choose *Current search. They dont have to first drop the restriction and then retype the search (as will happen if they forgot the restriction was on)

My final comment - could the tooltip for the restriction dropdown show the full text of the current search when you have one selected. As it gets shortened in the non-resizable dropdown to illegibility

Just my 2p. I don't really care that much, honest

Quote:

I don't know if you should unhook/rehook, or if you should have an 'ignore' flag. By the latter I mean using a double-signal arrangement, something like:

Yes using a variable would work. Though I still need a disconnect method (I think) for invoking to allow the object to be garbage collected when I create a new instance by the user switching libraries. And obviously some connect code when the object gets created. So we'll see what I ens up with

chaley · 04-19-2011, 09:34 AM

Quote:

Originally Posted by kiwidude

An alternative suggestion (since it is only fair if I say something I found unexpected is to suggest an alternative, however crap it may be) is:
(a) *Current search always stayed there
(b) a new entry of *dragon appeared below *Current search

Done, with the addition that the current search (the new line) is remembered. You can select some other search, then come back and select the 'new entry' again.

I don't see a need to clear that search if I select *current search with an empty search box. Instead I select index 0, which clears the restriction.

Quote:

My final comment - could the tooltip for the restriction dropdown show the full text of the current search when you have one selected. As it gets shortened in the non-resizable dropdown to illegibility

Done

Quote:

Yes using a variable would work. Though I still need a disconnect method (I think) for invoking to allow the object to be garbage collected when I create a new instance by the user switching libraries. And obviously some connect code when the object gets created. So we'll see what I ens up with

Yes, you certainly do need to disconnect if you abandon the instance. Alternatively, you can do the connection in the base plugin class that I am sure you don't abandon, then have it call whatever you want in the current instance.

Starson17 · 04-19-2011, 09:59 AM

Quote:

Originally Posted by kiwidude

Your example if I understand it correctly is as the result of a metadata data entry error, as the book has been given the wrong author.

My example was supposed to be where a father and son jointly authored a book, and I had two different formats of the same book, each listing only one of the two authors. So the authors were correct, but not all joint authors were listed. The book was the same, but the authors matched.

I have lots of similar cases. In my pre-Calibre days, when I relied on filename/folder indexing, I named my ebook files with only a single author, and I made multiple copies of each book, when I had multiple authors.
That way I could find the book under each author. When I imported those books into Calibre, each multiple-author book came in multiple times with single authors. I've been slowly merging them and fixing the authorship.

Quote:

...
So I think it is safer to not apply the author exclusion list to book searches and let the user make book based exemptions instead.

Agreed (It looks like you are way beyond this by now)

kiwidude · 04-19-2011, 10:08 AM

That all sounds great thanks. One more related question. Is there a robust way for me to know if the user has a custom versus a saved search selected at the time they find duplicates? As I have to call a different function to restore the value so I need to know which it is. My hack I use currently will break with your change.

Excellent point on the alternative place to hook the event from thanks.

chaley · 04-19-2011, 10:33 AM

Quote:

Originally Posted by kiwidude

That all sounds great thanks. One more related question. Is there a robust way for me to know if the user has a custom versus a saved search selected at the time they find duplicates? As I have to call a different function to restore the value so I need to know which it is. My hack I use currently will break with your change.

If the index is 2 and the first char is a *, it is a custom search. Be sure to remove the star before you restore the text.

Are you going to restore the search even if it wasn't active? My feeling is that you don't need to. In fact, after some reflection I wonder if you need to restore the restriction at all. The user can easily do that manually if desired.

kiwidude · 04-19-2011, 10:42 AM

Typing on my phone so apologies if my wording confused. I never restore a search but I do restore the restriction which might have been from a custom search. Your suggested code is what I would have done but wanted to make sure I wasn't missing something. It would be too weird for the clear search button to actually result in a search

kiwidude · 04-19-2011, 12:44 PM

This hopefully fixes two issues without introducing new ones:

Clicking in the tag browser exits duplicates mode
Seeing a 'None' message when the last result is merged

The first item is as per previous posts - you can either manually change a search restriction or click the clear button in the gui to exit search mode. I've made this change in such a way (temporarily) that it should work for people both running 0.7.56 and running from source with Charles's changes to the restrictions dropdown today.

kiwidude · 04-20-2011, 10:01 AM

Ok, so I have an implementation put together for supporting "author duplicate" (ignore title) searches. And it seems to actually work without having to completely start all over again, which is both surprising and gratifying.

My plan was to add the following algorithms:
- ignore title, similar author
- ignore title, fuzzy author

However having implemented the first to reuse the same "similar author" logic that I am using for "similar title, similar author" I noticed some unexpected fuzziness

Specifically, for my initial implementation of "similar author" for this plugin to get up and running I decided just to invoke Kovid's author simplifying algorithm used for metadata retrieval (in ebooks/metadata/sources/base.py in the Source class of get_author_tokens()).

What I found however is that I think it is a bit too fuzzy/aggressive for a "similar" author search. Specifically what it does that goes across my personal desire for "similar" is that it removes initials. So for example "J. Smith" becomes "Smith" and would match with "W. Smith" in a duplicate search.

Which brings the question of how fuzzy wuzzy does each algorithm go

So - my suggestion is that "similar authors" will use the same logic as get_author_tokens, but not strip initials. So that will be left with handling removing punctuation, different spacing and reversal of names like LN,FN to FN LN.

Then the "fuzzy authors" algorithm, would be left to be more aggressive. Either it could attempt to determine a "last name" and ignore everything else (and yes I know there are lots of issues with determining the "last" name with Jr. etc but we could if wanted attempt to cater for some common cases). Or slightly more usefully it could take the last name and prefix it with one initial, being either the first letter of the first name or first initial, whatever is found.

So W. Smith / Wayne Smith / Smith, W. would all match under either fuzzy proposal. However W. Smith / S. Smith would not return as a match under the second.

Or perhaps you have different ideas for "similar" and "fuzzy". What are your thoughts?

The attached plugin version has no changes to the "similar" logic so you can see for yourself. Other changes I made to support ignore title logic:

A new menu item of Show author exemptions (so you can see author exemptions or book exemptions)
Manage exemptions dialog displays any author exemptions for the selected book
Choosing an author based search will display/expand the authors node in the tag browser (let me know if you like that and/or want it to actually highlight the author under consideration)
Removing an exemption using the right-click menu removes all book or author exemptions found for that selection. It seemed simpler from a user perspective than two different menu items, and you can see what it is removing in the details window
Various other internal tweaks to support all the changes

As per usual I may have accidentally introduced some new quirks with this version, but I really wanted to get something out there for feedback so your patience and understanding is appreciated

.

chaley · 04-20-2011, 10:26 AM

I thought fuzzy wuzzy was a bear, not a duck.

Quote:

Originally Posted by kiwidude

What I found however is that I think it is a bit too fuzzy/aggressive for a "similar" author search. Specifically what it does that goes across my personal desire for "similar" is that it removes initials. So for example "J. Smith" becomes "Smith" and would match with "W. Smith" in a duplicate search.

Which brings the question of how fuzzy wuzzy does each algorithm go

So - my suggestion is that "similar authors" will use the same logic as get_author_tokens, but not strip initials. So that will be left with handling removing punctuation, different spacing and reversal of names like LN,FN to FN LN.

Sounds good, at least in theory. It is very conservative, which will be what is needed in many situations. Usage will tell, I suppose.

Quote:

So W. Smith / Wayne Smith / Smith, W. would all match under either fuzzy proposal. However W. Smith / S. Smith would not return as a match under the second.

Or perhaps you have different ideas for "similar" and "fuzzy". What are your thoughts?

I am a bit confused about which is which. I think you are saying:
Similar: same name ignoring punctuation and word order
Fuzzy, alt 1: Strip initials (one letter words?). Match what is left.
Fuzzy, alt 2: At least one word matches (how long must the word be?) The first letters of other words must match. Note that using this algorithm, I think that Sam Wayne would match Wayne Smith. I don't see how you can avoid this, unless you start attaching great meaning to commas.

If I have this right, then I think I agree with you. Similar should be as described, which is very conservative.
Least fuzzy should be alt 2.
More fuzzy should be alt 1.

You might consider inserting soundex between least fuzzy and more fuzzy. It should work reasonably well, at least for names that are pronounced reasonably correctly in English.

Will try the plugin real-soon-now.

kiwidude · 04-20-2011, 10:38 AM

When you go this fuzzy, bear == duck.

Sorry for the confusion, I'm just throwing ideas out there. Yes, I am saying that Similar should be the conservative approach of only removing punctuation and looking at word order.

As to how many and what form the "fuzzy" algorithms take I welcome all input. Even better, write me and post a function for any proposal. One that takes an author name (well actually a list but we only consider the first author), and returns a string result representing the fuzzied result.

My brain hurts a bit right now from twisting it though the permutations of author and book searches over the last few days (and goodreads metadata before that) so undoubtedly others will have better coding suggestions than I can conjure up in my current state.

[Ahmed] · 04-23-2011, 03:25 AM

The current version still fails at detecting books like this:

Title
Title: Subtitle

Example:
Brian Greene - The Hidden Reality
Brian Greene - The Hidden Reality: Parallel Universes and the Deep Laws of the Cosmos

I've used every option and still find manual duplicates.

ldolse · 04-23-2011, 04:14 AM

Quote:

Originally Posted by [Ahmed]

The current version still fails at detecting books like this:

Title
Title: Subtitle

Example:
Brian Greene - The Hidden Reality
Brian Greene - The Hidden Reality: Parallel Universes and the Deep Laws of the Cosmos

I've used every option and still find manual duplicates.

Not sure if the plugin is using the tokenize author/title functions in metadata.source.base, but if it is I added an option to the tokenize title function to strip subtitles. It basically strips everything after a colon/slash, or inside of various kinds of brackets/parentheses. It does do a sanity check to make sure a title will still exist before removing the extra info.

04-19-2011, 12:44 PM	#115
kiwidude Calibre Plugins Developer Posts: 4,637 Karma: 2162064 Join Date: Oct 2010 Location: Australia Device: Kindle Oasis	v0.4.2 Beta This hopefully fixes two issues without introducing new ones: Clicking in the tag browser exits duplicates mode Seeing a 'None' message when the last result is merged The first item is as per previous posts - you can either manually change a search restriction or click the clear button in the gui to exit search mode. I've made this change in such a way (temporarily) that it should work for people both running 0.7.56 and running from source with Charles's changes to the restrictions dropdown today. Last edited by kiwidude; 04-22-2011 at 05:59 AM. Reason: Removed attachment as later version in thread

04-20-2011, 10:01 AM	#116
kiwidude Calibre Plugins Developer Posts: 4,637 Karma: 2162064 Join Date: Oct 2010 Location: Australia Device: Kindle Oasis	v 0.5 Beta - How fuzzy wuzzy wuzza duck? Ok, so I have an implementation put together for supporting "author duplicate" (ignore title) searches. And it seems to actually work without having to completely start all over again, which is both surprising and gratifying. My plan was to add the following algorithms: - ignore title, similar author - ignore title, fuzzy author However having implemented the first to reuse the same "similar author" logic that I am using for "similar title, similar author" I noticed some unexpected fuzziness Specifically, for my initial implementation of "similar author" for this plugin to get up and running I decided just to invoke Kovid's author simplifying algorithm used for metadata retrieval (in ebooks/metadata/sources/base.py in the Source class of get_author_tokens()). What I found however is that I think it is a bit too fuzzy/aggressive for a "similar" author search. Specifically what it does that goes across my personal desire for "similar" is that it removes initials. So for example "J. Smith" becomes "Smith" and would match with "W. Smith" in a duplicate search. Which brings the question of how fuzzy wuzzy does each algorithm go So - my suggestion is that "similar authors" will use the same logic as get_author_tokens, but not strip initials. So that will be left with handling removing punctuation, different spacing and reversal of names like LN,FN to FN LN. Then the "fuzzy authors" algorithm, would be left to be more aggressive. Either it could attempt to determine a "last name" and ignore everything else (and yes I know there are lots of issues with determining the "last" name with Jr. etc but we could if wanted attempt to cater for some common cases). Or slightly more usefully it could take the last name and prefix it with one initial, being either the first letter of the first name or first initial, whatever is found. So W. Smith / Wayne Smith / Smith, W. would all match under either fuzzy proposal. However W. Smith / S. Smith would not return as a match under the second. Or perhaps you have different ideas for "similar" and "fuzzy". What are your thoughts? The attached plugin version has no changes to the "similar" logic so you can see for yourself. Other changes I made to support ignore title logic: A new menu item of Show author exemptions (so you can see author exemptions or book exemptions) Manage exemptions dialog displays any author exemptions for the selected book Choosing an author based search will display/expand the authors node in the tag browser (let me know if you like that and/or want it to actually highlight the author under consideration) Removing an exemption using the right-click menu removes all book or author exemptions found for that selection. It seemed simpler from a user perspective than two different menu items, and you can see what it is removing in the details window Various other internal tweaks to support all the changes As per usual I may have accidentally introduced some new quirks with this version, but I really wanted to get something out there for feedback so your patience and understanding is appreciated . Last edited by kiwidude; 04-25-2011 at 02:16 PM. Reason: Removed attachment as later version in this thread

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Duplicate Detection	Philosopher	Library Management	114	09-08-2022 07:03 PM
[GUI Plugin] Plugin Updater Deprecated	kiwidude	Plugins	159	06-19-2011 12:27 PM
Duplicate Detection	albill	Calibre	2	10-26-2010 02:21 PM
New Plugin Type Idea: Library Plugin	cgranade	Plugins	3	09-15-2010 12:11 PM
Help with Chapter detection	ubergeeksov	Calibre	0	09-02-2010 04:56 AM

04-19-2011, 10:08 AM	#112
kiwidude Calibre Plugins Developer Posts: 4,637 Karma: 2162064 Join Date: Oct 2010 Location: Australia Device: Kindle Oasis	That all sounds great thanks. One more related question. Is there a robust way for me to know if the user has a custom versus a saved search selected at the time they find duplicates? As I have to call a different function to restore the value so I need to know which it is. My hack I use currently will break with your change. Excellent point on the alternative place to hook the event from thanks.

04-19-2011, 10:42 AM	#114
kiwidude Calibre Plugins Developer Posts: 4,637 Karma: 2162064 Join Date: Oct 2010 Location: Australia Device: Kindle Oasis	Typing on my phone so apologies if my wording confused. I never restore a search but I do restore the restriction which might have been from a custom search. Your suggested code is what I would have done but wanted to make sure I wasn't missing something. It would be too weird for the clear search button to actually result in a search

04-20-2011, 10:38 AM	#118
kiwidude Calibre Plugins Developer Posts: 4,637 Karma: 2162064 Join Date: Oct 2010 Location: Australia Device: Kindle Oasis	When you go this fuzzy, bear == duck. Sorry for the confusion, I'm just throwing ideas out there. Yes, I am saying that Similar should be the conservative approach of only removing punctuation and looking at word order. As to how many and what form the "fuzzy" algorithms take I welcome all input. Even better, write me and post a function for any proposal. One that takes an author name (well actually a list but we only consider the first author), and returns a string result representing the fuzzied result. My brain hurts a bit right now from twisting it though the permutations of author and book searches over the last few days (and goodreads metadata before that) so undoubtedly others will have better coding suggestions than I can conjure up in my current state.

04-23-2011, 03:25 AM	#119
[Ahmed] Member Posts: 14 Karma: 10 Join Date: Sep 2010 Device: Kindle³	The current version still fails at detecting books like this: Title Title: Subtitle Example: Brian Greene - The Hidden Reality Brian Greene - The Hidden Reality: Parallel Universes and the Deep Laws of the Cosmos I've used every option and still find manual duplicates.

Advert

Advert