05-01-2011, 06:41 AM | #241 |
Addict
Posts: 293
Karma: 21022
Join Date: Mar 2011
Location: NL
Device: Sony PRS-650
|
@kiwidude
Alright, case closed, thanks @chaley this is not a 'must be'. For example, you could use a hashset with authors (so every author is just inserted once) and one for titles. If you find a match in both sets, you lookup the books having the matched title (A). Than you lookup the authors (B) of this book and check if these authors (B) would match any titles written by Author (A). Than it is a match. No big memory issue and even not a big CPU-issue I think. |
05-01-2011, 06:53 AM | #242 | |
Grand Sorcerer
Posts: 11,742
Karma: 6997045
Join Date: Jan 2010
Location: Notts, England
Device: Kobo Libra 2
|
Quote:
In addition, I don't see how it would work at reasonable performance. By definition you would have a book's title and authors in the main sets. It seems that you would be doing multiple set intersections on a book-by-book basis, especially when factoring in exemption groups. But this is neither here nor there, as I could easily be wrong. |
|
Advert | |
|
05-01-2011, 07:35 AM | #243 |
Addict
Posts: 293
Karma: 21022
Join Date: Mar 2011
Location: NL
Device: Sony PRS-650
|
I did not look into the code, so I do not know about difference in implementation.
But since this is an other function than the current duplicates, it maybe would have to be implemented different. But since this option is not implemented, I think I'm 'spamming' this topic by telling my idea's. But if you want to know: Spoiler:
Last edited by drMerry; 05-01-2011 at 07:40 AM. Reason: updated some 'sample code' |
05-01-2011, 01:11 PM | #244 |
Calibre Plugins Developer
Posts: 4,637
Karma: 2162064
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
drMerry - one thing I don't think you have commented on as yet is the memory stability of this version. You had issues with using 1.0 on an old laptop with a lot of exemptions - have you tried repeating the scenario with the new version and is that problem resolved?
|
05-01-2011, 02:22 PM | #245 | |
Addict
Posts: 293
Karma: 21022
Join Date: Mar 2011
Location: NL
Device: Sony PRS-650
|
Quote:
no problems any more!! |
|
Advert | |
|
05-01-2011, 09:23 PM | #246 |
Calibre Plugins Developer
Posts: 4,637
Karma: 2162064
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
v1.0.4 Beta
Changes in this release:
As always any feedback appreciated. Once again a number of core areas were affected by adding support for multiple author handling so there could be some gremlins lurking that my quick testing has not yet found. Last edited by kiwidude; 05-02-2011 at 09:03 AM. Reason: Removed attachment as later version in thread |
05-02-2011, 04:00 AM | #247 |
Grand Sorcerer
Posts: 11,742
Karma: 6997045
Join Date: Jan 2010
Location: Notts, England
Device: Kobo Libra 2
|
Works a treat.
The new multiple authors stuff found a previously undetected problem. I had a book X1 with authors "B, A & D, C", and another book in the series X2 with authors "B, A and C D". The authors-only test zeroed right in on it. |
05-02-2011, 05:10 AM | #248 |
Calibre Plugins Developer
Posts: 4,637
Karma: 2162064
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
Cool Charles, only fair that since you suggested the improvement you get some benefit out of it
There is one more scenario that this plugin will not catch and I don't know whether to try to cater for it. That is where the authors have the names in a different order but not swapped with comma. So you have A B and B, A will match. but A B and B A will not, nor will A, B with B, A. You could find duplicate books using an ignore author search so that is fine if you do have a duplicate title. However a user might argue that an author duplicate based search should find this. How often does it happen. Probably more than it should. I think the option of swapping names when adding authors is partly at fault, as if you have that selected it can give unintended results. It is once again that old chestnut of no setting for commas in a display name. So if I have a file with the name A B for author, and swap names checked, then I get the author B A rather than B, A. I think it could have it's logic tweaked to say if no comma when swapping then add one in and vice versa. However that might upset people who for some reason had the names stored without a comma the wrong way around and not want commas in the display name. How prevalent that is I do not know? Now if the user downloads metadata and has overwrite author ticked, then the name gets fixed so the problem should go away. However there is still the issue for a lot of legacy books or where the user decides not to download metadata. So... Is it something I should try to cater for? I think it should just be in the author duplicate (ignore title) searches if we do. There is the minor thing of it creating more false positives, such as two authors whose names when flipped happen to match but hopefully that is relatively rare and easily exempted. EDIT: I'm going to dispute my own suggestion here (I do talk to myself a lot). I think that differentiating between title vs author searches is wrong, they should both have this check. The question should be just whether identical author searches should have it or not. Which I think the correct answer is no. It means more code twiddling on my part but hopefully a more consistent result so you can get A B / B,A / B A and A,B all in the same group. Last edited by kiwidude; 05-02-2011 at 05:56 AM. Reason: Added extra thought |
05-02-2011, 07:20 AM | #249 | ||
Grand Sorcerer
Posts: 11,742
Karma: 6997045
Join Date: Jan 2010
Location: Notts, England
Device: Kobo Libra 2
|
Quote:
Quote:
It might be that the importance of this issue is sufficient to embed a small quality check into the dups code, so that installing the quality check plugin isn't required. The problem I see is that the check has nothing to do with duplicates, so the UI and code isn't quite right. I suppose that you could construct a single dup group containing books with authors that don't conform to the desired format. |
||
05-02-2011, 07:43 AM | #250 |
Calibre Plugins Developer
Posts: 4,637
Karma: 2162064
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
An interesting point - given that I have been dishing out "this is not a duplicate issue, it should be in Quality Check" answers for others it is only fair I be made a similar suggestion to myself
I guess what I saw as being slightly different in this situation is that "Author Duplicate" searches are all about finding similar variants of the same name. We currently find A B and B, A (as well as A C. B etc etc). So I guess my thought on this was that if you are showing A B and B,A then why not also show A,B and B A. That's where I would suggest there is an argument to say the check "could" have something to do with duplicates. However as with my own post edits you can see how such thinking can lead to it pervading the title based searches as well and the line becomes very grey. However I completely agree that putting it as a Quality Check function is a consistent alternate approach. After all if we are saying that QC should try to detect titles having series info in them, and (one day) titles and authors being the wrong way around, why not also have a check for author names reversed. Particularly as it already has the slight variants such as checks for authors with/without commas. The downside is that it is "another thing" you must run from time to time. From an implementation perspective if it was in Find Duplicates I was just thinking I could compute two hashes for authors names - one as per now, one as reversed. Store the alternate hash in a separate dictionary, then at the end do a pass through that dictionary testing to see if the reversed hash value exists in the candidate groups and if so merge the results together. Something like that anyways, not given it masses of thought. |
05-02-2011, 07:54 AM | #251 | ||
Grand Sorcerer
Posts: 11,742
Karma: 6997045
Join Date: Jan 2010
Location: Notts, England
Device: Kobo Libra 2
|
Quote:
Quote:
If we are doing author checks, then the same thing still works (I think). Adding all permutations of an author will generate two duplicate groups for every author that appears in both forms. Again, set pruning will take care of this. I don't think there is a memory issue here, because number of additional candidate sets == the number of authors. Edit: The number of additional candidate sets is equal to number of title/author pairs, not the number of authors. We will see if this is too big. Last edited by chaley; 05-02-2011 at 08:53 AM. |
||
05-02-2011, 08:14 AM | #252 | |
Calibre Plugins Developer
Posts: 4,637
Karma: 2162064
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
Quote:
I think I will put it in this plugin. Unlike some of the other suggestions this is one example where you can *only* detect it by having multiple *variants* of the author. So any occurrences not found by the find duplicates approach could not be found by any other means (except for comparing with some sort of external authors database). It is fairly easy and "cheap" to add here, and as you have said above it could occur in people's databases rather more often than they might like if they have been a bit undisciplined in their approach to adding books. |
|
05-02-2011, 09:03 AM | #253 |
Calibre Plugins Developer
Posts: 4,637
Karma: 2162064
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
v1.0.5 Beta
Changes in this release:
Barring anything raised here this is the code I will release as v1.1 later today. Last edited by kiwidude; 05-02-2011 at 01:18 PM. Reason: Removed attachment as later version in thread |
05-02-2011, 11:41 AM | #254 |
Grand Sorcerer
Posts: 11,742
Karma: 6997045
Join Date: Jan 2010
Location: Notts, England
Device: Kobo Libra 2
|
One small problem: I have a book with two different authors with the same soundex value. When doing an authors-only check, I get a group with one book in it. Was this intended?
Edit: It also would be nice to have "Don't show me again" checkboxes, at least on the add exemption warning dialogs. Last edited by chaley; 05-02-2011 at 11:43 AM. |
05-02-2011, 11:43 AM | #255 |
Calibre Plugins Developer
Posts: 4,637
Karma: 2162064
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
Hmmm... a new situation pops up from the co-author changes. No doubt you could also get the same scenario if the co-authors were "similar" etc. Thanks for flagging this up. I guess the question is - is this actually invalid?
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Duplicate Detection | Philosopher | Library Management | 114 | 09-08-2022 07:03 PM |
[GUI Plugin] Plugin Updater **Deprecated** | kiwidude | Plugins | 159 | 06-19-2011 12:27 PM |
Duplicate Detection | albill | Calibre | 2 | 10-26-2010 02:21 PM |
New Plugin Type Idea: Library Plugin | cgranade | Plugins | 3 | 09-15-2010 12:11 PM |
Help with Chapter detection | ubergeeksov | Calibre | 0 | 09-02-2010 04:56 AM |