![]() |
#166 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
Playing around with this now - the ignore title rocks! Haven't played with the tag browser enough to comment on that. I agree with Chaley that this is more or less ready to release as is.
Soundex was really helpful too - I'm not sure if letting users tweak the fuzziness would help much, unless you're talking about making it less fuzzy - while it's quite useful for finding issues the other algorithms miss it does have a higher number of false positives. I noticed you mentioned an issue with non-ascii in Soundex earlier - there is already a function in Calibre to convert a non-ascii character to it's ascii equivalent - are you using this already? I noticed Soundex caught China Miéville vs. China Mieville while the other algorithms missed this. Though thinking out loud doing this ascii downgrade any time you detect non-ascii for the purposes of comparison could be useful. This might be an advanced/too specialized option, but I keep multiple version of book records, but general only one record that's 'published' to OPDS/Externally accessible library instances, etc. I do this by adding a tag 'Nopub' to the ones I don't want published. I'd rather do this than merge book records and risk having a faulty version overwrite a good version during conversion/merges etc. The faulty versions I keep around for conversion testing or just because I haven't gotten around to fully comparing the editions. Anyway the thought behind the request is to automatically exempt sets of dupes where all but one in the set have some specific/configurable tag. Other feedback:
edit: the non-ascii to ascii equivalent function is get_udc.decode() from calibre.utils.localization Last edited by ldolse; 04-26-2011 at 02:18 PM. |
![]() |
![]() |
![]() |
#167 |
Calibre Plugins Developer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,729
Karma: 2197770
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
Hi Idolse, Thanks for all the feedback.
I decided as per the screenshot I may as well offer the "tweaking" of soundex fuzziness. As I said the values I picked were purely arbitrary so others can if they want do multiple passes with smaller or larger values. Thanks for the heads up on the decode function. Actually right now I don't do any conversion of those values, I just ignore them! So it was purely by coincidence (that i and e have the same soundex values and were following each other) that it worked for you in this case. However your suggestion to use this for all comparisons I think is an inspired idea. Not I think on the "identical" matches, but on every other kind of match I think it makes sense? Obviously I don't know the downside if any to that function - is there any reason not to use it everywhere other than performance which isn't really much of a concern imho versus the improved matching? In terms of exempting your nopub tags. You can do this already, by setting a restriction at the time you do your duplicate search. Duplicate search respects any restrictions applied, so just put "not tags:Nopub" as a restriction and you should be good to go. For the keyboard shortcut - you can configure one yourself in the plugin configuration screen for that function. As promised earlier in this thread I am going to remove the default I have in there currently of next/previous match. If someone has a brilliant suggestion for next/previous that doesn't conflict with existing calibre shortcuts I am all ears. The best I have seen suggested was Ctrl+N (with maybe Ctrl+Shift+N for previous). If people want that then I will put that in otherwise you can just configure your own. The "last set" wonkiness sounds very odd and I haven't been able to replicate it. Was that showing one group at a time or all at once? You said finishing "all/most" - which was it? When did it go wonky - how many sets were remaining? |
![]() |
![]() |
Advert | |
|
![]() |
#168 | ||
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
Quote:
Quote:
Thanks for the other tips - glad that's all already trivial to do. |
||
![]() |
![]() |
![]() |
#169 |
Calibre Plugins Developer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,729
Karma: 2197770
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
It is quite a hit relatively speaking but in real world terms not enough to make me not go with it for the 1.0 release in the plugins forum. To scan a 40000 book library went from 0.7 secs to just on 1 sec.
If you are able to figure out some repro steps for the other issue that would be great. No one else has reported anything like it and I can't visualise the possible code issue from your description so anything more concrete would be great. |
![]() |
![]() |
![]() |
#170 |
Addict
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 293
Karma: 21022
Join Date: Mar 2011
Location: NL
Device: Sony PRS-650
|
This is great.
1.0 runs in less then a second while until 0.5 it took 15 seconds (leaving out my mass-dups) Thanks!!! |
![]() |
![]() |
Advert | |
|
![]() |
#171 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
@ kiwidude
For your consideration: your reply to the guy who wanted book content matching made me think of another guy who wanted to find binary duplicate book records. My personal opinion is that there aren't many cases where binary duplicates won't have matching metadata that can be found with this plugin. Either this plugin would find matching metadata, or the books won't be binary dupes due to internal metadata changes. Binary duplicate finding programs are also available to do this outside Calibre. I mention it only because (unlike content matching) it could be implemented with a reasonably fast routine and there are some situations where it might prove useful. The case I recall was the guy who wanted this after adding lots of books that all ended up as individual books for each format and all were Unknown author and Unknown title. |
![]() |
![]() |
![]() |
#172 |
Calibre Plugins Developer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,729
Karma: 2197770
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
@Starson - yes that's a good point we haven't discussed for a while.
Of course the downside for people who haven't done this before they add books to their library is how to actually resolve the problem if they then run an external tool, as if they do a direct delete out of the calibre folders they end up with orphaned Calibre database records. I agree with you in that there should be no reason under "normal" circumstances it would be necessary. That "unknown author, unknown title" situation however sounds rather horrific! I wonder what the best way to integrate that into the dialog would be? Perhaps there should be a new set of radio buttons across the top of "Title/Author", "ISBN", "Binary compare"? Would an appropriate implementation be to use the hashlib sha256 stuff or something else? |
![]() |
![]() |
![]() |
#173 | ||||
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Quote:
Quote:
It seems to be a pretty rare problem, but it might be useful. Quote:
Quote:
![]() |
||||
![]() |
![]() |
![]() |
#174 |
Calibre Plugins Developer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,729
Karma: 2197770
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
Right, so there is a bit of a flaw it seems in the approach to storing the exemptions. It needs some kind of optimisation (or else a threshold limit above which users are not allowed to exempt a group).
At the moment if you exempt a group it creates a dictionary entry for every book id in the exemption group, listing every other book in that group as it's exemption pair match. So if you exempt a group of (1,2,3), you end up with: { 1: (2,3) 2: (1,3) 3: (1,2) } Now that works really well when the groups are small, it is simple and flexible to maintain. However, what do you do when someone does a fuzzy search, comes back with hundreds if not thousands of titles in the group, and then does mark as exempt on it? I can tell you what Calibre does - memory climbs very rapidly and then it crashes ![]() So - any suggestions as to what we do? The easy way out would be to put an exemption group size limit. After all if you have hundreds or more books coming back in a group it should be a trivial thing to skip past that group. The alternative I really hate to think about how much more complicated it will make the code. As the user is allowed to remove any one or more of those individual pairings. Any thoughts? |
![]() |
![]() |
![]() |
#175 | |
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 12,447
Karma: 8012886
Join Date: Jan 2010
Location: Notts, England
Device: Kobo Libra 2
|
Quote:
The size problem would still arise when building the exemption map. If the user selects all the books in a 40,000 book library and then says that they are exempt, things get dicey. The maximum size for a 40,000 book library would be 40,000*40,000, or 1,600,000,000 integers, which is too large to be practical. You can avoid this sort of silliness by limiting the size of any exemption group to something like 1,000 books. Even that number is silly-large. Perhaps 100 is sufficient. Another possibility is to allow exemption processing only when using the least conservative algorithms. The groups would by definition be smaller. You might consider storing the exemption map instead of pairs. In this case, removing exemptions would be done pair-wise. If I have (10,15), I would remove 15 from 10 and 10 from 15. If the user asks to remove exemptions for (10, 11, 12), then it would be 10-(11, 12), 11-(10, 12), 12-(10, 11). Random thoughts. |
|
![]() |
![]() |
![]() |
#176 |
Calibre Plugins Developer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,729
Karma: 2197770
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
Hey Charles,
I've been avoiding spending much time thinking through how this would work, but I'm afraid some of your random thoughts have me confused. And as your thoughts frequently lead to or are based on very good ideas I would like to unconfuse myself ![]() My initial thought was also that you would have to store each exemption group as a list of it's members which I assume you mean the same way. So your persistence of groups would look be a list of lists like: [(1,2,3),(10,11,12,13,14,15),(1,5,6),...] You say that this is "compatible" with the partitioning algorithm. I don't see how without some surgery on it, unless you turn the above into an exemption map first? If that is the case, then I think we get back on the same page of the existing issue of the permutations, which only a threshold is going to resolve. Or were you thinking there is some way of doing this without an exemption map? The last paragraph you lost me again. Are you saying we would store both the exemptions as groups and with some pairs indicating any removals from the groups? Or that you partition the exemption groups with a removal? Or something else? ![]() |
![]() |
![]() |
![]() |
#177 | |||
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 12,447
Karma: 8012886
Join Date: Jan 2010
Location: Notts, England
Device: Kobo Libra 2
|
Quote:
Quote:
I am not convinced that the non_dup map memory is a problem in normal usage, but it does get problematic when people do silly things. The space problem you mention arises because the current non_dup map is a cartesian product of all the exemption groups, which has excellent performance properties but perhaps less than ideal space properties. In our case the problem might be manageable. An exemption group of 200 elements will generate 200 sets of 200 elements, or 40,000 items. 1000 elements will generate 1000 sets, or 1,000,000 elements. Neither of these is outrageously large. To verify, I measured memory use of my test code. An exemption group of 1000 elements used 21MB of memory Unfortunately, if someone builds an exemption group of 5,000 elements, things start getting dicey. That number requires 700MB of memory. 8,000 elements requires 1.1GB. What this shows is that for a reasonable number of exemptions, the memory usage isn't out of line. A 'reasonable number' seems to be around 2000 unique elements in the union of the exemption groups. If you are uncomfortable with the limit and the non-graceful degradation, then another implementation suggests itself. This is the surgery you mentioned. Instead of initially building a cartesian product, we build a list of sets. When we need a non_dup map entry, we do the union of the sets. This alternate implementation trades off performance for memory stability. The amount of memory required is only that needed to store the exemption groups. The performance penalty is doing the union when required instead of once in the beginning. Considering the performance penalty: we are again faced with the 'normal usage' issue. This implementation would do the equivalent of building the exemption map per duplicate group instead of once. If there is one duplicate group, then no penalty. If there are 1,000 duplicate groups, then we pay the cost of building the non_dup map 1,000 times. Thus the question: will we in fact encounter situations where we have both thousands of duplicate groups and exemption lists with thousands of elements? I hope not. In any event, this algorithm degrades more gracefully than the other one. One reason for this is that per duplicate group, the union is done only N times where N is the number of elements in the duplicate group. If there are thousands of duplicate groups, they should be rather small, mitigating the performance problem. Sample implementation under the spoiler. There aren't many changes. The not_duplicate_of_map is now a map to a list of sets. The union of these sets is computed when required, once per item in a duplicate group. You will see that the test has one exemption group of 100,000 elements. Spoiler:
Quote:
I think the last question is how to handle removal of exemptions. Not storing pairs complicates things a bit. I think that the following works. Assume that the user selects 'manage exemptions for this book'. First problem: construct the exemption list for the book. This is done in the same way as duplicate partioning: construct the non_dup map, then do the union of the sets for the master book. Second problem: the user checks a book or two (or 1,000), asking that the non_dup exemption be removed for those books. What the user is really asking for is that the master book and the selected books never appear together in any exemption group. To satisfy this wish, you can: - iterate over all the existing exemption groups, removing the master book. Exemption groups with length <= 1 are removed. - create a new exemption group consisting of the master book and all unchecked books in the list. This process works because what is initially displayed in the union of all the exemption groups containing the master book. After deleting the master book from all the exemption groups, that union is empty. Adding the new exemption group makes the union correct again. One possible issue is that over time, the number of exemption groups could grow. This is certainly true if the user selects a zillion books and marks them as exempt, then goes over that list over time to unmark them. Performance will degrade in this case, but I am not sure by how much. |
|||
![]() |
![]() |
![]() |
#178 |
Calibre Plugins Developer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,729
Karma: 2197770
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
Great stuff as usual Charles. All your comments around the partitioning mirrored my thoughts it seems. My confusion over the compatibility issue was based purely around the need for it to be based on having a map which we agree on.
I guess the question is that memory vs performance tradeoff and whether to just go ahead and build dup group maps on the fly. That is not the only place the exemption map is used of course as I have for instance menu items enabled based on whether a book has exemptions, along with the show all exemptions stuff. So these would need to change to use the build on the fly stuff too. Your answer to the issue of removing from an exemption group is a clever one. I was thinking of partitioning the exemption groups but yours is interesting and simpler. I honestly am sceptical about how frequently users would be removing subsets of a group, we gave them the ability but in reality I doubt few would use it and sparingly at best. As what it is saying is that you screwed up and have decided that some authors or books are actually duplicates of each other. Well if they are duplicates, you are going to end up renaming the author or merging the books, so they will disappear from the exemption lists when my house cleaning kicks in anyways. So in other words I am not too concerned with ongoing fragmentation of the exemption sets from that particular action. Sounds like the beginnings of a plan. I will try some testing I guess to see what the performance impact is if we build the map on the fly. I think 1000 members of a group is a reasonable limit anyways? To get that many members you must have either a really crappy set of book metadata like all the titles unknown, or be storing something like magazines where the majority of the name is identical. It is trivial for a user to exclude magazines or whatever by using a search restriction if they don't want them to appear in the results of a soundex or fuzzy search. |
![]() |
![]() |
![]() |
#179 |
Calibre Plugins Developer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,729
Karma: 2197770
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
@starson - there will be one issue with doing the binary compare in calibre - how to tell the user which formats were identical if the two books have multiple formats overlapping. Is this really a concern? As we are showing the user they have two duplicate book entries which they need to resolve in general. They just won't know if say they both have ePub and mobi which are different and which being identical don't need examining.
One option that comes to mind is an extra menu item of something like "Binary compare selected books". This would attempt to tell you which formats of which book are duplicates of each other. I'm trying to continue to avoid custom columns. Or we could just not worry about it and leave that functionality for a future super duper merge plugin. Which would do the binary compare again as part of it's "smart" merge recommendation... |
![]() |
![]() |
![]() |
#180 | ||
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 12,447
Karma: 8012886
Join Date: Jan 2010
Location: Notts, England
Device: Kobo Libra 2
|
Quote:
Spoiler:
Quote:
Don't add the limit until we have an idea what the performance might be. I think it will be acceptable, even with large exemption sets. |
||
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Duplicate Detection | Philosopher | Library Management | 114 | 09-08-2022 07:03 PM |
[GUI Plugin] Plugin Updater **Deprecated** | kiwidude | Plugins | 159 | 06-19-2011 12:27 PM |
Duplicate Detection | albill | Calibre | 2 | 10-26-2010 02:21 PM |
New Plugin Type Idea: Library Plugin | cgranade | Plugins | 3 | 09-15-2010 12:11 PM |
Help with Chapter detection | ubergeeksov | Calibre | 0 | 09-02-2010 04:56 AM |