04-30-2011, 06:50 PM | #226 |
Calibre Plugins Developer
Posts: 4,637
Karma: 2162064
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
I have updated to 1.0.3 on the post above to ensure search_getting_ids is passed the current search restriction value from self.gui.tags_view.search_restriction
No this wasn't a new bug, it was an old one that was always there for ISBNs, and copy/paste of the approach made it also apply to binary comparisons. I don't know what you mean by results tomorrow, you don't mean running overnight do you? The binary search of 8000 books should complete pretty quickly. On my system it did 75,000 formats from 40,000 books in 4 minutes (about 65 seconds on a re-run). |
04-30-2011, 06:54 PM | #227 |
Addict
Posts: 293
Karma: 21022
Join Date: Mar 2011
Location: NL
Device: Sony PRS-650
|
well, that is what I mend,
I'm going to sleep now It's still running, but my nas is overloaded and I'm watching a show on the Internet so it's going very slow. If it takes too long (I know by tomorrow) I will report it. Have a good night. |
Advert | |
|
05-01-2011, 04:26 AM | #228 | |
Grand Sorcerer
Posts: 11,742
Karma: 6997045
Join Date: Jan 2010
Location: Notts, England
Device: Kobo Libra 2
|
Quote:
Two thoughts: - I suggested that you use delete_first = True when calling add_multiple_custom_book_data. After reading the posts about using a restriction on binary matching, I am no longer convinced I was right. If I run the check over my entire library, all values are cached. If I then run it over a restricted set, all values are thrown away except the small set. Next time I run it over the full library, all must be recomputed. This seems suboptimal. The thought in using the delete first was to clean the cache, but I think I just demonstrated that we really don't want to do that. - The author match algorithm seems to operate only on first authors. For fun I duplicated a book then swapped the author names, and it isn't found as a dup. Should it, or should it compare author-by-author? Continuing on the same vein, if I am doing an author-only dup check, shouldn't I be looking only at authors and shouldn't each author be taken into account? For example, assume I have two books titled Xyzzy, one by "Blogs, Joe & Angstrom, Alice" and one by "Angstrom, Alice & Blogs, Joe". In one school of thought, these books have identical titles and identical authors, yet no title+author search option will find them. This could be fixed by creating an entry in the candidates map for each author, not just the first author. Note that if this is done, the algorithm would decide that Xyzzy by "A, B & C, D" matches Xyzzy by "C, D", which I think is the right answer. Another example: again I have two books, one by "Blogs, Joe & Angstrom, Alice" and another by "Anngstrom, Alice". I do a search by author, ignoring title. I would expect a soundex search to put these two books into a group because they share an author that matches soundex but is spelled differently. This is a different case than above, because I am explicitly looking for authors; order should be irrelevant. Like above, this can be fixed by creating an entry in the candidate map for each author, not just the first author. |
|
05-01-2011, 05:26 AM | #229 |
Addict
Posts: 293
Karma: 21022
Join Date: Mar 2011
Location: NL
Device: Sony PRS-650
|
Another function that should be usefull:
Author/Title - swap By binary comparison I found a duplicate books I would not have found with other functions: Book 1: Anna Karenina by Leo Tolstoy Book 2: Leo Tolstoy by Anna Karenina Also found 2 other books with this problem (both where books with a name as title so it was not that obvious when you take a quick look) |
05-01-2011, 05:27 AM | #230 |
Addict
Posts: 293
Karma: 21022
Join Date: Mar 2011
Location: NL
Device: Sony PRS-650
|
When I did a rerun on the binary search with other network and cpu consuming programs closed, It worked fast.
So your plugin has no speed issues (I did not even think it had, but to be sure...). |
Advert | |
|
05-01-2011, 05:38 AM | #231 |
Calibre Plugins Developer
Posts: 4,637
Karma: 2162064
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
Hi Charles,
I actually wrote you an email asking you about the delete of the cache parameter. I didn't send it because I figured there must be something I was missing again and thought I had met my dumb question quota for the day. I guess I should have looked closer, haha. The authors point is an interesting one. One issue which I didn't see you mention of having multiple author cache entries is replication of groups. So if I have a duplicate title with these variations of authors... A, B & C, D C, D & A, B I will have two buckets. Both will have the same titles in them. There could be variations of this "issue" depending on how many other A, B matches or C, D matches there are so the buckets might be unbalanced. I guess the question is whether this is a problem or not. If the user resolves their duplicates in order, the second group if identical would disappear automatically. If they skip through them with highlighting it may jump around a bit but still be valid. If they added exemptions using mark all groups it would create some duplication I think but not a major drama. |
05-01-2011, 05:47 AM | #232 |
Calibre Plugins Developer
Posts: 4,637
Karma: 2162064
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
@drMerry, glad to hear you are happy with the performance. Just don't run the binary search with a restriction until I push a new version which doesn't delete your cache values.
In terms of title author swap I will give the same answer I gave previously. I don't see it as a find duplicates thing to address, for the same reasons that having series info in the title shouldn't be matched. What if you only have one book, regardless of which way it is flipped? As I think I said previously I see this as something a plugin like quality check could try to check for, though without other matches for the author name it is hard to identify. |
05-01-2011, 05:56 AM | #233 | |
Addict
Posts: 293
Karma: 21022
Join Date: Mar 2011
Location: NL
Device: Sony PRS-650
|
Quote:
only one book ... would never be a duplicate (or always since book A == book A) The purpose of this plugin is to check for books that are twice in your library. The purpose of the quality-check is to check individual books (in my opinion..) So the previous point could not be found with quality check. You will not want to check one book for possible author swap. (What is your reason to tell me Leo Tollstoy by Anna Katherina is swapped?) If there are 2 books with same (similar) name and title but swapped you have a good indication for a mistake. So this option is only possible by testing several books against each other, a function dupli-search does, but quali shouldn't All my opinion of-course |
|
05-01-2011, 05:57 AM | #234 | |
Addict
Posts: 293
Karma: 21022
Join Date: Mar 2011
Location: NL
Device: Sony PRS-650
|
I see I used your last part to state my 'truth'.
Quote:
|
|
05-01-2011, 06:09 AM | #235 |
Calibre Plugins Developer
Posts: 4,637
Karma: 2162064
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
The point of mine you are missing is that finding books with title/author are swapped that are duplicates of each other is only a subset that could be of size zero of books in your library that have title/author swapped.
So to properly solve this problem requires a more comprehensive and multiple pronged approach than just looking for a duplicate title. |
05-01-2011, 06:12 AM | #236 | |||
Grand Sorcerer
Posts: 11,742
Karma: 6997045
Join Date: Jan 2010
Location: Notts, England
Device: Kobo Libra 2
|
Quote:
Quote:
Subsets can be removed rather easily, with performance that should be acceptable if there aren't thousands of groups. Something like this: Spoiler:
Quote:
|
|||
05-01-2011, 06:14 AM | #237 |
Addict
Posts: 293
Karma: 21022
Join Date: Mar 2011
Location: NL
Device: Sony PRS-650
|
Well, that's right, but there is (in my opinion) no way to check all books for this.
If you have one (say a little known comic (all pictures)) book with swapped author / title without isbn, you can not find metadata for it. So, you will never know if the info is swapped if you do not read it yourself. The only way to solve some of the problems in your lib is to find books that have duplicates in it with swapped author/title. It can not solve all your problems (like none of your plugins can (their still great )) but it will solve some. While it is a multi-book operation and an operation to find possible duplicates, it should be in this plugin I think. Properly (=complete) solving this problem is no option. |
05-01-2011, 06:24 AM | #238 |
Calibre Plugins Developer
Posts: 4,637
Karma: 2162064
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
Charles, glad to hear I wasn't imagining things. Totally agree that pruning the groups is the right thing to do, I was just thinking out loud of the impact of not handling it.
|
05-01-2011, 06:29 AM | #239 | |
Grand Sorcerer
Posts: 11,742
Karma: 6997045
Join Date: Jan 2010
Location: Notts, England
Device: Kobo Libra 2
|
Quote:
Edit: it is worse than I thought. The author names must be added in both FN LN and LN, FN order, because we don't know the order of any 'names' in the title. That means 3 entries per book, not two. Yes, one can say "don't do that", but we have hard experience that people will do 'unreasonable' things because they seem perfectly reasonable to them. Then kiwidude must deal with an unhappy user. Up to him if he wants to take on that burden. Last edited by chaley; 05-01-2011 at 08:35 AM. |
|
05-01-2011, 06:31 AM | #240 |
Calibre Plugins Developer
Posts: 4,637
Karma: 2162064
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
drMerry -I agree there is no 100% guaranteed way. However I utterly disagree that looking for duplicates is the only or best way, for the reasons I keep repeating - there will be many books in this situation of title/author beingvswapped which could be found very easily by non duplicate method approaches. And this will solve a far higher percentage of the problems than trying to bodge it into find duplicates will do. When I find the time it will appear in quality check which IMHO is a more appropriate place for it.
Will there still be edge cases that it cannot find? Absolutely. But I can guarantee that it will find many more cases than putting it in this plugin would. |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Duplicate Detection | Philosopher | Library Management | 114 | 09-08-2022 07:03 PM |
[GUI Plugin] Plugin Updater **Deprecated** | kiwidude | Plugins | 159 | 06-19-2011 12:27 PM |
Duplicate Detection | albill | Calibre | 2 | 10-26-2010 02:21 PM |
New Plugin Type Idea: Library Plugin | cgranade | Plugins | 3 | 09-15-2010 12:11 PM |
Help with Chapter detection | ubergeeksov | Calibre | 0 | 09-02-2010 04:56 AM |