Duplicate detection plugin - Page 12

ldolse · 04-26-2011, 02:11 PM

Playing around with this now - the ignore title rocks! Haven't played with the tag browser enough to comment on that. I agree with Chaley that this is more or less ready to release as is.

Soundex was really helpful too - I'm not sure if letting users tweak the fuzziness would help much, unless you're talking about making it less fuzzy - while it's quite useful for finding issues the other algorithms miss it does have a higher number of false positives.

I noticed you mentioned an issue with non-ascii in Soundex earlier - there is already a function in Calibre to convert a non-ascii character to it's ascii equivalent - are you using this already? I noticed Soundex caught China Miéville vs. China Mieville while the other algorithms missed this. Though thinking out loud doing this ascii downgrade any time you detect non-ascii for the purposes of comparison could be useful.

This might be an advanced/too specialized option, but I keep multiple version of book records, but general only one record that's 'published' to OPDS/Externally accessible library instances, etc. I do this by adding a tag 'Nopub' to the ones I don't want published. I'd rather do this than merge book records and risk having a faulty version overwrite a good version during conversion/merges etc. The faulty versions I keep around for conversion testing or just because I haven't gotten around to fully comparing the editions.

Anyway the thought behind the request is to automatically exempt sets of dupes where all but one in the set have some specific/configurable tag.

Other feedback:

Keyboard shortcut for exempting a group would be extremely helpful
Things seem to go a bit wonky when you reach the last set, at least with ignore title searches. After finishing all/most of the original sets it the 'next set' function began jumping all over the place and highlighting things that weren't really sets. I didn't even realize I was done until it started acting strange and I initiated a fresh search which returned no results.

edit: the non-ascii to ascii equivalent function is get_udc.decode() from calibre.utils.localization

kiwidude · 04-26-2011, 02:40 PM

Hi Idolse, Thanks for all the feedback.

I decided as per the screenshot I may as well offer the "tweaking" of soundex fuzziness. As I said the values I picked were purely arbitrary so others can if they want do multiple passes with smaller or larger values.

Thanks for the heads up on the decode function. Actually right now I don't do any conversion of those values, I just ignore them! So it was purely by coincidence (that i and e have the same soundex values and were following each other) that it worked for you in this case.

However your suggestion to use this for all comparisons I think is an inspired idea. Not I think on the "identical" matches, but on every other kind of match I think it makes sense? Obviously I don't know the downside if any to that function - is there any reason not to use it everywhere other than performance which isn't really much of a concern imho versus the improved matching?

In terms of exempting your nopub tags. You can do this already, by setting a restriction at the time you do your duplicate search. Duplicate search respects any restrictions applied, so just put "not tags:Nopub" as a restriction and you should be good to go.

For the keyboard shortcut - you can configure one yourself in the plugin configuration screen for that function. As promised earlier in this thread I am going to remove the default I have in there currently of next/previous match. If someone has a brilliant suggestion for next/previous that doesn't conflict with existing calibre shortcuts I am all ears. The best I have seen suggested was Ctrl+N (with maybe Ctrl+Shift+N for previous). If people want that then I will put that in otherwise you can just configure your own.

The "last set" wonkiness sounds very odd and I haven't been able to replicate it. Was that showing one group at a time or all at once? You said finishing "all/most" - which was it? When did it go wonky - how many sets were remaining?

ldolse · 04-26-2011, 09:34 PM

Quote:

Originally Posted by kiwidude

Thanks for the heads up on the decode function. Actually right now I don't do any conversion of those values, I just ignore them! So it was purely by coincidence (that i and e have the same soundex values and were following each other) that it worked for you in this case.

However your suggestion to use this for all comparisons I think is an inspired idea. Not I think on the "identical" matches, but on every other kind of match I think it makes sense? Obviously I don't know the downside if any to that function - is there any reason not to use it everywhere other than performance which isn't really much of a concern imho versus the improved matching?

Agree it doesn't make sense for the 'identical' algorithm. I'm not sure what the performance hit would be - it looks like you can pass whole strings over to it, and it does a decode test there, filenames.py has an example. I suspect it wouldn't be much.

Quote:

Originally Posted by kiwidude

The "last set" wonkiness sounds very odd and I haven't been able to replicate it. Was that showing one group at a time or all at once? You said finishing "all/most" - which was it? When did it go wonky - how many sets were remaining?

It was late when I ran into it, and my vagueness was because I didn't go back and try to reproduce with notes. Will try again and give you more specifics since you couldn't easily reproduce it yourself.

Thanks for the other tips - glad that's all already trivial to do.

kiwidude · 04-27-2011, 03:42 AM

It is quite a hit relatively speaking but in real world terms not enough to make me not go with it for the 1.0 release in the plugins forum. To scan a 40000 book library went from 0.7 secs to just on 1 sec.

If you are able to figure out some repro steps for the other issue that would be great. No one else has reported anything like it and I can't visualise the possible code issue from your description so anything more concrete would be great.

drMerry · 04-27-2011, 12:32 PM

This is great.
1.0 runs in less then a second while until 0.5 it took 15 seconds (leaving out my mass-dups)

Thanks!!!

Starson17 · 04-27-2011, 01:43 PM

@ kiwidude

For your consideration: your reply to the guy who wanted book content matching made me think of another guy who wanted to find binary duplicate book records.

My personal opinion is that there aren't many cases where binary duplicates won't have matching metadata that can be found with this plugin. Either this plugin would find matching metadata, or the books won't be binary dupes due to internal metadata changes. Binary duplicate finding programs are also available to do this outside Calibre.

I mention it only because (unlike content matching) it could be implemented with a reasonably fast routine and there are some situations where it might prove useful. The case I recall was the guy who wanted this after adding lots of books that all ended up as individual books for each format and all were Unknown author and Unknown title.

kiwidude · 04-27-2011, 02:17 PM

@Starson - yes that's a good point we haven't discussed for a while.

Of course the downside for people who haven't done this before they add books to their library is how to actually resolve the problem if they then run an external tool, as if they do a direct delete out of the calibre folders they end up with orphaned Calibre database records.

I agree with you in that there should be no reason under "normal" circumstances it would be necessary. That "unknown author, unknown title" situation however sounds rather horrific!

I wonder what the best way to integrate that into the dialog would be? Perhaps there should be a new set of radio buttons across the top of "Title/Author", "ISBN", "Binary compare"?

Would an appropriate implementation be to use the hashlib sha256 stuff or something else?

Starson17 · 04-27-2011, 02:31 PM

Quote:

Originally Posted by kiwidude

Of course the downside for people who haven't done this before they add books to their library is how to actually resolve the problem if they then run an external tool, as if they do a direct delete out of the calibre folders they end up with orphaned Calibre database records.

Yes, an internal tool integrated into your dupe/detect module is preferred. That's why I mentioned it.

Quote:

I agree with you in that there should be no reason under "normal" circumstances it would be necessary. That "unknown author, unknown title" situation however sounds rather horrific!

Now that I think about it, IIRC, the default is actually author=Unknown, title=filename minus extension. In his case, he had the books with some weird filename, but they were originally inside folders with author/title metadata info in the folder names. He lost the author/title folder info on import. I think the recommendation given was to search for Unknown authors, sort by added date, delete them all, fix the filenames to include author/title info with an external program, then re-import, but he still had lots of binary dupes, so he had to address those before import.

It seems to be a pretty rare problem, but it might be useful.

Quote:

I wonder what the best way to integrate that into the dialog would be? Perhaps there should be a new set of radio buttons across the top of "Title/Author", "ISBN", "Binary compare"?

Sounds good to me.

Quote:

Would an appropriate implementation be to use the hashlib sha256 stuff or something else?

Hash it

kiwidude · 04-27-2011, 02:48 PM

Right, so there is a bit of a flaw it seems in the approach to storing the exemptions. It needs some kind of optimisation (or else a threshold limit above which users are not allowed to exempt a group).

At the moment if you exempt a group it creates a dictionary entry for every book id in the exemption group, listing every other book in that group as it's exemption pair match. So if you exempt a group of (1,2,3), you end up with:
{ 1: (2,3)
2: (1,3)
3: (1,2) }

Now that works really well when the groups are small, it is simple and flexible to maintain.

However, what do you do when someone does a fuzzy search, comes back with hundreds if not thousands of titles in the group, and then does mark as exempt on it?

I can tell you what Calibre does - memory climbs very rapidly and then it crashes

So - any suggestions as to what we do? The easy way out would be to put an exemption group size limit. After all if you have hundreds or more books coming back in a group it should be a trivial thing to skip past that group.

The alternative I really hate to think about how much more complicated it will make the code. As the user is allowed to remove any one or more of those individual pairings.

Any thoughts?

chaley · 04-27-2011, 05:13 PM

Quote:

Originally Posted by kiwidude

Right, so there is a bit of a flaw it seems in the approach to storing the exemptions. It needs some kind of optimisation (or else a threshold limit above which users are not allowed to exempt a group).

At the moment if you exempt a group it creates a dictionary entry for every book id in the exemption group, listing every other book in that group as it's exemption pair match. So if you exempt a group of (1,2,3), you end up with:
{ 1: (2,3)
2: (1,3)
3: (1,2) }

Now that works really well when the groups are small, it is simple and flexible to maintain.

However, what do you do when someone does a fuzzy search, comes back with hundreds if not thousands of titles in the group, and then does mark as exempt on it?

I can tell you what Calibre does - memory climbs very rapidly and then it crashes

So - any suggestions as to what we do? The easy way out would be to put an exemption group size limit. After all if you have hundreds or more books coming back in a group it should be a trivial thing to skip past that group.

The alternative I really hate to think about how much more complicated it will make the code. As the user is allowed to remove any one or more of those individual pairings.

Any thoughts?

The algorithm that I gave you is perfectly happy to use exemption groups like (1,3,4,5,6,7). It merges groups as they are built. As such, what happens if you store exemptions as the user provides them? If the user selects a bunch of books to exempt, store that bunch of books as an exemption.

The size problem would still arise when building the exemption map. If the user selects all the books in a 40,000 book library and then says that they are exempt, things get dicey. The maximum size for a 40,000 book library would be 40,000*40,000, or 1,600,000,000 integers, which is too large to be practical. You can avoid this sort of silliness by limiting the size of any exemption group to something like 1,000 books. Even that number is silly-large. Perhaps 100 is sufficient.

Another possibility is to allow exemption processing only when using the least conservative algorithms. The groups would by definition be smaller.

You might consider storing the exemption map instead of pairs. In this case, removing exemptions would be done pair-wise. If I have (10,15), I would remove 15 from 10 and 10 from 15. If the user asks to remove exemptions for (10, 11, 12), then it would be 10-(11, 12), 11-(10, 12), 12-(10, 11).

Random thoughts.

kiwidude · 04-27-2011, 07:16 PM

Hey Charles,

I've been avoiding spending much time thinking through how this would work, but I'm afraid some of your random thoughts have me confused. And as your thoughts frequently lead to or are based on very good ideas I would like to unconfuse myself

My initial thought was also that you would have to store each exemption group as a list of it's members which I assume you mean the same way. So your persistence of groups would look be a list of lists like:
[(1,2,3),(10,11,12,13,14,15),(1,5,6),...]

You say that this is "compatible" with the partitioning algorithm. I don't see how without some surgery on it, unless you turn the above into an exemption map first? If that is the case, then I think we get back on the same page of the existing issue of the permutations, which only a threshold is going to resolve. Or were you thinking there is some way of doing this without an exemption map?

The last paragraph you lost me again. Are you saying we would store both the exemptions as groups and with some pairs indicating any removals from the groups? Or that you partition the exemption groups with a removal? Or something else?

chaley · 04-28-2011, 01:51 AM

Quote:

Originally Posted by kiwidude

My initial thought was also that you would have to store each exemption group as a list of it's members which I assume you mean the same way. So your persistence of groups would look be a list of lists like:
[(1,2,3),(10,11,12,13,14,15),(1,5,6),...]

Yes

Quote:

You say that this is "compatible" with the partitioning algorithm. I don't see how without some surgery on it, unless you turn the above into an exemption map first? If that is the case, then I think we get back on the same page of the existing issue of the permutations, which only a threshold is going to resolve. Or were you thinking there is some way of doing this without an exemption map?

First, that storage mechanism is compatible with the existing not_dup_map construction because of the bug you found. It now generates the union of the sets. That is what we want it to do.

I am not convinced that the non_dup map memory is a problem in normal usage, but it does get problematic when people do silly things. The space problem you mention arises because the current non_dup map is a cartesian product of all the exemption groups, which has excellent performance properties but perhaps less than ideal space properties. In our case the problem might be manageable. An exemption group of 200 elements will generate 200 sets of 200 elements, or 40,000 items. 1000 elements will generate 1000 sets, or 1,000,000 elements. Neither of these is outrageously large. To verify, I measured memory use of my test code. An exemption group of 1000 elements used 21MB of memory

Unfortunately, if someone builds an exemption group of 5,000 elements, things start getting dicey. That number requires 700MB of memory. 8,000 elements requires 1.1GB.

What this shows is that for a reasonable number of exemptions, the memory usage isn't out of line. A 'reasonable number' seems to be around 2000 unique elements in the union of the exemption groups.

If you are uncomfortable with the limit and the non-graceful degradation, then another implementation suggests itself. This is the surgery you mentioned. Instead of initially building a cartesian product, we build a list of sets. When we need a non_dup map entry, we do the union of the sets. This alternate implementation trades off performance for memory stability. The amount of memory required is only that needed to store the exemption groups. The performance penalty is doing the union when required instead of once in the beginning.

Considering the performance penalty: we are again faced with the 'normal usage' issue. This implementation would do the equivalent of building the exemption map per duplicate group instead of once. If there is one duplicate group, then no penalty. If there are 1,000 duplicate groups, then we pay the cost of building the non_dup map 1,000 times.

Thus the question: will we in fact encounter situations where we have both thousands of duplicate groups and exemption lists with thousands of elements? I hope not. In any event, this algorithm degrades more gracefully than the other one. One reason for this is that per duplicate group, the union is done only N times where N is the number of elements in the duplicate group. If there are thousands of duplicate groups, they should be rather small, mitigating the performance problem.

Sample implementation under the spoiler. There aren't many changes. The not_duplicate_of_map is now a map to a list of sets. The union of these sets is computed when required, once per item in a duplicate group. You will see that the test has one exemption group of 100,000 elements.

Spoiler:

Code:

from collections import defaultdict

# Construct map of books that are not duplicates
dups = [(3,4), (3,5), (3, 6), (7, 8), (8, 9), tuple([i for i in xrange(1,100000)])]
#print 'books known not to be duplicates', dups
not_duplicate_of_map = defaultdict(list)
for t in dups:
  s = set(t)
  for b in t:
    not_duplicate_of_map[b].append(s)

# Simulate a test
initial_dups = [i for i in xrange(1,100)]
initial_dups.sort()
print 'candidate duplicates', initial_dups


def merge_sets(list_of_sets):
    if len(list_of_sets) == 0:
        return set()
    if len(list_of_sets) == 1:
        return list_of_sets[0]
    return set().union(*list_of_sets)
    
# Initial condition -- the group contains 1 set of all elements
results = [set(initial_dups)]
partitioning_ids = [None]
# Loop through the set of duplicates, checking to see if the entry is in a non-dup set
for one_dup in initial_dups:
    if one_dup in not_duplicate_of_map:
        ndm_entry = merge_sets(not_duplicate_of_map[one_dup])
        # The entry is indeed in a non-dup set. We may need to partition
        for i,res in enumerate(results):
            if one_dup in res:
                # This result group contains the item with a non-dup set. If the item
                # was the one that caused this result group to partition in the first place,
                # then we must not partition again or we will make subsets of the group 
                # that split this partition off. Consider a group of (1,2,3,4) and
                # non-dups of [(1,2), (2,3)]. The first partition will give us (1,3,4)
                # and (2,3,4). Later when we discover (2,3), if we partition (2,3,4)
                # again, we will end up with (2,4) and (3,4), but (3,4) is a subset 
                # of (1,3,4). All we need to do is remove 3 from the (2,3,4) partition.
                if one_dup == partitioning_ids[i]:
                    results[i] = (res - ndm_entry) | set([one_dup])
                    continue
                # Must partition. We already have one partition, the one in our hand.
                # Remove the dups from it, then create new partitions for each of the dups.
                results[i] = (res - ndm_entry) | set([one_dup])
                for nd in ndm_entry:
                    # Only partition if the duplicate is larger than the one we are looking
                    # at. This is necessary because the non-dup set map is complete,
                    # map[2] == (2,3), and map[3] == (2,3). We know that when processing
                    # the set for 3, we have already done the work for the element 2.
                    if nd > one_dup and nd in res:
                        results.append((res - ndm_entry) | set([nd]))
                        partitioning_ids.append(nd)

print 'After partioning'
sr = []
for r in results:
    if len(r) > 1:
        sr.append(sorted(list(r)))
sr.sort()
for r in sr:
    print r

Quote:

The last paragraph you lost me again. Are you saying we would store both the exemptions as groups and with some pairs indicating any removals from the groups? Or that you partition the exemption groups with a removal? Or something else?

My suggestion is silly. Ignore it.

I think the last question is how to handle removal of exemptions. Not storing pairs complicates things a bit. I think that the following works.

Assume that the user selects 'manage exemptions for this book'. First problem: construct the exemption list for the book. This is done in the same way as duplicate partioning: construct the non_dup map, then do the union of the sets for the master book.

Second problem: the user checks a book or two (or 1,000), asking that the non_dup exemption be removed for those books. What the user is really asking for is that the master book and the selected books never appear together in any exemption group. To satisfy this wish, you can:
- iterate over all the existing exemption groups, removing the master book. Exemption groups with length <= 1 are removed.
- create a new exemption group consisting of the master book and all unchecked books in the list.

This process works because what is initially displayed in the union of all the exemption groups containing the master book. After deleting the master book from all the exemption groups, that union is empty. Adding the new exemption group makes the union correct again.

One possible issue is that over time, the number of exemption groups could grow. This is certainly true if the user selects a zillion books and marks them as exempt, then goes over that list over time to unmark them. Performance will degrade in this case, but I am not sure by how much.

kiwidude · 04-28-2011, 04:00 AM

Great stuff as usual Charles. All your comments around the partitioning mirrored my thoughts it seems. My confusion over the compatibility issue was based purely around the need for it to be based on having a map which we agree on.

I guess the question is that memory vs performance tradeoff and whether to just go ahead and build dup group maps on the fly. That is not the only place the exemption map is used of course as I have for instance menu items enabled based on whether a book has exemptions, along with the show all exemptions stuff. So these would need to change to use the build on the fly stuff too.

Your answer to the issue of removing from an exemption group is a clever one. I was thinking of partitioning the exemption groups but yours is interesting and simpler. I honestly am sceptical about how frequently users would be removing subsets of a group, we gave them the ability but in reality I doubt few would use it and sparingly at best. As what it is saying is that you screwed up and have decided that some authors or books are actually duplicates of each other. Well if they are duplicates, you are going to end up renaming the author or merging the books, so they will disappear from the exemption lists when my house cleaning kicks in anyways.

So in other words I am not too concerned with ongoing fragmentation of the exemption sets from that particular action.

Sounds like the beginnings of a plan. I will try some testing I guess to see what the performance impact is if we build the map on the fly. I think 1000 members of a group is a reasonable limit anyways? To get that many members you must have either a really crappy set of book metadata like all the titles unknown, or be storing something like magazines where the majority of the name is identical. It is trivial for a user to exclude magazines or whatever by using a search restriction if they don't want them to appear in the results of a soundex or fuzzy search.

kiwidude · 04-28-2011, 05:09 AM

@starson - there will be one issue with doing the binary compare in calibre - how to tell the user which formats were identical if the two books have multiple formats overlapping. Is this really a concern? As we are showing the user they have two duplicate book entries which they need to resolve in general. They just won't know if say they both have ePub and mobi which are different and which being identical don't need examining.

One option that comes to mind is an extra menu item of something like "Binary compare selected books". This would attempt to tell you which formats of which book are duplicates of each other. I'm trying to continue to avoid custom columns.

Or we could just not worry about it and leave that functionality for a future super duper merge plugin. Which would do the binary compare again as part of it's "smart" merge recommendation...

chaley · 04-28-2011, 06:07 AM

Quote:

Originally Posted by kiwidude

I guess the question is that memory vs performance tradeoff and whether to just go ahead and build dup group maps on the fly. That is not the only place the exemption map is used of course as I have for instance menu items enabled based on whether a book has exemptions, along with the show all exemptions stuff. So these would need to change to use the build on the fly stuff too.

You can hide the on-demand part by creating a subclass of dict and adding a method to return the set. Just for fun, an example is below the spoiler.

Spoiler:

Quote:

I think 1000 members of a group is a reasonable limit anyways? To get that many members you must have either a really crappy set of book metadata like all the titles unknown, or be storing something like magazines where the majority of the name is identical. It is trivial for a user to exclude magazines or whatever by using a search restriction if they don't want them to appear in the results of a soundex or fuzzy search.

Not convinced. I can see someone doing a soundex search, getting 4000 matches in the group, looking through them, and deciding that they are all non_dups. Highlight them all and bang! add the exemption.

Don't add the limit until we have an idea what the performance might be. I think it will be acceptable, even with large exemption sets.

04-26-2011, 02:11 PM	#166
ldolse Wizard Posts: 1,337 Karma: 123455 Join Date: Apr 2009 Location: Malaysia Device: PRS-650, iPhone	Playing around with this now - the ignore title rocks! Haven't played with the tag browser enough to comment on that. I agree with Chaley that this is more or less ready to release as is. Soundex was really helpful too - I'm not sure if letting users tweak the fuzziness would help much, unless you're talking about making it less fuzzy - while it's quite useful for finding issues the other algorithms miss it does have a higher number of false positives. I noticed you mentioned an issue with non-ascii in Soundex earlier - there is already a function in Calibre to convert a non-ascii character to it's ascii equivalent - are you using this already? I noticed Soundex caught China Miéville vs. China Mieville while the other algorithms missed this. Though thinking out loud doing this ascii downgrade any time you detect non-ascii for the purposes of comparison could be useful. This might be an advanced/too specialized option, but I keep multiple version of book records, but general only one record that's 'published' to OPDS/Externally accessible library instances, etc. I do this by adding a tag 'Nopub' to the ones I don't want published. I'd rather do this than merge book records and risk having a faulty version overwrite a good version during conversion/merges etc. The faulty versions I keep around for conversion testing or just because I haven't gotten around to fully comparing the editions. Anyway the thought behind the request is to automatically exempt sets of dupes where all but one in the set have some specific/configurable tag. Other feedback: Keyboard shortcut for exempting a group would be extremely helpful Things seem to go a bit wonky when you reach the last set, at least with ignore title searches. After finishing all/most of the original sets it the 'next set' function began jumping all over the place and highlighting things that weren't really sets. I didn't even realize I was done until it started acting strange and I initiated a fresh search which returned no results. edit: the non-ascii to ascii equivalent function is get_udc.decode() from calibre.utils.localization Last edited by ldolse; 04-26-2011 at 02:18 PM.

04-26-2011, 02:40 PM	#167
kiwidude Calibre Plugins Developer Posts: 4,729 Karma: 2197770 Join Date: Oct 2010 Location: Australia Device: Kindle Oasis	Hi Idolse, Thanks for all the feedback. I decided as per the screenshot I may as well offer the "tweaking" of soundex fuzziness. As I said the values I picked were purely arbitrary so others can if they want do multiple passes with smaller or larger values. Thanks for the heads up on the decode function. Actually right now I don't do any conversion of those values, I just ignore them! So it was purely by coincidence (that i and e have the same soundex values and were following each other) that it worked for you in this case. However your suggestion to use this for all comparisons I think is an inspired idea. Not I think on the "identical" matches, but on every other kind of match I think it makes sense? Obviously I don't know the downside if any to that function - is there any reason not to use it everywhere other than performance which isn't really much of a concern imho versus the improved matching? In terms of exempting your nopub tags. You can do this already, by setting a restriction at the time you do your duplicate search. Duplicate search respects any restrictions applied, so just put "not tags:Nopub" as a restriction and you should be good to go. For the keyboard shortcut - you can configure one yourself in the plugin configuration screen for that function. As promised earlier in this thread I am going to remove the default I have in there currently of next/previous match. If someone has a brilliant suggestion for next/previous that doesn't conflict with existing calibre shortcuts I am all ears. The best I have seen suggested was Ctrl+N (with maybe Ctrl+Shift+N for previous). If people want that then I will put that in otherwise you can just configure your own. The "last set" wonkiness sounds very odd and I haven't been able to replicate it. Was that showing one group at a time or all at once? You said finishing "all/most" - which was it? When did it go wonky - how many sets were remaining? Attached Thumbnails

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Duplicate Detection	Philosopher	Library Management	114	09-08-2022 07:03 PM
[GUI Plugin] Plugin Updater Deprecated	kiwidude	Plugins	159	06-19-2011 12:27 PM
Duplicate Detection	albill	Calibre	2	10-26-2010 02:21 PM
New Plugin Type Idea: Library Plugin	cgranade	Plugins	3	09-15-2010 12:11 PM
Help with Chapter detection	ubergeeksov	Calibre	0	09-02-2010 04:56 AM

04-27-2011, 03:42 AM	#169
kiwidude Calibre Plugins Developer Posts: 4,729 Karma: 2197770 Join Date: Oct 2010 Location: Australia Device: Kindle Oasis	It is quite a hit relatively speaking but in real world terms not enough to make me not go with it for the 1.0 release in the plugins forum. To scan a 40000 book library went from 0.7 secs to just on 1 sec. If you are able to figure out some repro steps for the other issue that would be great. No one else has reported anything like it and I can't visualise the possible code issue from your description so anything more concrete would be great.

04-27-2011, 12:32 PM	#170
drMerry Addict Posts: 293 Karma: 21022 Join Date: Mar 2011 Location: NL Device: Sony PRS-650	This is great. 1.0 runs in less then a second while until 0.5 it took 15 seconds (leaving out my mass-dups) Thanks!!!

04-27-2011, 01:43 PM	#171
Starson17 Wizard Posts: 4,004 Karma: 177841 Join Date: Dec 2009 Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T	@ kiwidude For your consideration: your reply to the guy who wanted book content matching made me think of another guy who wanted to find binary duplicate book records. My personal opinion is that there aren't many cases where binary duplicates won't have matching metadata that can be found with this plugin. Either this plugin would find matching metadata, or the books won't be binary dupes due to internal metadata changes. Binary duplicate finding programs are also available to do this outside Calibre. I mention it only because (unlike content matching) it could be implemented with a reasonably fast routine and there are some situations where it might prove useful. The case I recall was the guy who wanted this after adding lots of books that all ended up as individual books for each format and all were Unknown author and Unknown title.

04-27-2011, 02:17 PM	#172
kiwidude Calibre Plugins Developer Posts: 4,729 Karma: 2197770 Join Date: Oct 2010 Location: Australia Device: Kindle Oasis	@Starson - yes that's a good point we haven't discussed for a while. Of course the downside for people who haven't done this before they add books to their library is how to actually resolve the problem if they then run an external tool, as if they do a direct delete out of the calibre folders they end up with orphaned Calibre database records. I agree with you in that there should be no reason under "normal" circumstances it would be necessary. That "unknown author, unknown title" situation however sounds rather horrific! I wonder what the best way to integrate that into the dialog would be? Perhaps there should be a new set of radio buttons across the top of "Title/Author", "ISBN", "Binary compare"? Would an appropriate implementation be to use the hashlib sha256 stuff or something else?

04-27-2011, 02:48 PM	#174
kiwidude Calibre Plugins Developer Posts: 4,729 Karma: 2197770 Join Date: Oct 2010 Location: Australia Device: Kindle Oasis	Right, so there is a bit of a flaw it seems in the approach to storing the exemptions. It needs some kind of optimisation (or else a threshold limit above which users are not allowed to exempt a group). At the moment if you exempt a group it creates a dictionary entry for every book id in the exemption group, listing every other book in that group as it's exemption pair match. So if you exempt a group of (1,2,3), you end up with: { 1: (2,3) 2: (1,3) 3: (1,2) } Now that works really well when the groups are small, it is simple and flexible to maintain. However, what do you do when someone does a fuzzy search, comes back with hundreds if not thousands of titles in the group, and then does mark as exempt on it? I can tell you what Calibre does - memory climbs very rapidly and then it crashes So - any suggestions as to what we do? The easy way out would be to put an exemption group size limit. After all if you have hundreds or more books coming back in a group it should be a trivial thing to skip past that group. The alternative I really hate to think about how much more complicated it will make the code. As the user is allowed to remove any one or more of those individual pairings. Any thoughts?

04-27-2011, 07:16 PM	#176
kiwidude Calibre Plugins Developer Posts: 4,729 Karma: 2197770 Join Date: Oct 2010 Location: Australia Device: Kindle Oasis	Hey Charles, I've been avoiding spending much time thinking through how this would work, but I'm afraid some of your random thoughts have me confused. And as your thoughts frequently lead to or are based on very good ideas I would like to unconfuse myself My initial thought was also that you would have to store each exemption group as a list of it's members which I assume you mean the same way. So your persistence of groups would look be a list of lists like: [(1,2,3),(10,11,12,13,14,15),(1,5,6),...] You say that this is "compatible" with the partitioning algorithm. I don't see how without some surgery on it, unless you turn the above into an exemption map first? If that is the case, then I think we get back on the same page of the existing issue of the permutations, which only a threshold is going to resolve. Or were you thinking there is some way of doing this without an exemption map? The last paragraph you lost me again. Are you saying we would store both the exemptions as groups and with some pairs indicating any removals from the groups? Or that you partition the exemption groups with a removal? Or something else?

04-28-2011, 04:00 AM	#178
kiwidude Calibre Plugins Developer Posts: 4,729 Karma: 2197770 Join Date: Oct 2010 Location: Australia Device: Kindle Oasis	Great stuff as usual Charles. All your comments around the partitioning mirrored my thoughts it seems. My confusion over the compatibility issue was based purely around the need for it to be based on having a map which we agree on. I guess the question is that memory vs performance tradeoff and whether to just go ahead and build dup group maps on the fly. That is not the only place the exemption map is used of course as I have for instance menu items enabled based on whether a book has exemptions, along with the show all exemptions stuff. So these would need to change to use the build on the fly stuff too. Your answer to the issue of removing from an exemption group is a clever one. I was thinking of partitioning the exemption groups but yours is interesting and simpler. I honestly am sceptical about how frequently users would be removing subsets of a group, we gave them the ability but in reality I doubt few would use it and sparingly at best. As what it is saying is that you screwed up and have decided that some authors or books are actually duplicates of each other. Well if they are duplicates, you are going to end up renaming the author or merging the books, so they will disappear from the exemption lists when my house cleaning kicks in anyways. So in other words I am not too concerned with ongoing fragmentation of the exemption sets from that particular action. Sounds like the beginnings of a plan. I will try some testing I guess to see what the performance impact is if we build the map on the fly. I think 1000 members of a group is a reasonable limit anyways? To get that many members you must have either a really crappy set of book metadata like all the titles unknown, or be storing something like magazines where the majority of the name is identical. It is trivial for a user to exclude magazines or whatever by using a search restriction if they don't want them to appear in the results of a soundex or fuzzy search.

04-28-2011, 05:09 AM	#179
kiwidude Calibre Plugins Developer Posts: 4,729 Karma: 2197770 Join Date: Oct 2010 Location: Australia Device: Kindle Oasis	@starson - there will be one issue with doing the binary compare in calibre - how to tell the user which formats were identical if the two books have multiple formats overlapping. Is this really a concern? As we are showing the user they have two duplicate book entries which they need to resolve in general. They just won't know if say they both have ePub and mobi which are different and which being identical don't need examining. One option that comes to mind is an extra menu item of something like "Binary compare selected books". This would attempt to tell you which formats of which book are duplicates of each other. I'm trying to continue to avoid custom columns. Or we could just not worry about it and leave that functionality for a future super duper merge plugin. Which would do the binary compare again as part of it's "smart" merge recommendation...

Advert

Advert