Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Development

Notices

Reply
 
Thread Tools Search this Thread
Old 05-01-2011, 06:41 AM   #241
drMerry
Addict
drMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmos
 
drMerry's Avatar
 
Posts: 293
Karma: 21022
Join Date: Mar 2011
Location: NL
Device: Sony PRS-650
@kiwidude
Alright, case closed, thanks

@chaley
this is not a 'must be'.
For example, you could use a hashset with authors (so every author is just inserted once) and one for titles.
If you find a match in both sets, you lookup the books having the matched title (A). Than you lookup the authors (B) of this book and check if these authors (B) would match any titles written by Author (A).
Than it is a match. No big memory issue and even not a big CPU-issue I think.
drMerry is offline   Reply With Quote
Old 05-01-2011, 06:53 AM   #242
chaley
Grand Sorcerer
chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.
 
Posts: 11,742
Karma: 6997045
Join Date: Jan 2010
Location: Notts, England
Device: Kobo Libra 2
Quote:
Originally Posted by drMerry View Post
@chaley
this is not a 'must be'.
For example, you could use a hashset with authors (so every author is just inserted once) and one for titles.
If you find a match in both sets, you lookup the books having the matched title (A). Than you lookup the authors (B) of this book and check if these authors (B) would match any titles written by Author (A).
Than it is a match. No big memory issue and even not a big CPU-issue I think.
The problem is that what you propose is a completely different implementation from what is there now. My 'only way' assumed that kiwidude wasn't going to rewrite the plugin from the ground up.

In addition, I don't see how it would work at reasonable performance. By definition you would have a book's title and authors in the main sets. It seems that you would be doing multiple set intersections on a book-by-book basis, especially when factoring in exemption groups. But this is neither here nor there, as I could easily be wrong.
chaley is offline   Reply With Quote
Old 05-01-2011, 07:35 AM   #243
drMerry
Addict
drMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmos
 
drMerry's Avatar
 
Posts: 293
Karma: 21022
Join Date: Mar 2011
Location: NL
Device: Sony PRS-650
I did not look into the code, so I do not know about difference in implementation.
But since this is an other function than the current duplicates, it maybe would have to be implemented different.
But since this option is not implemented, I think I'm 'spamming' this topic by telling my idea's.
But if you want to know:
Spoiler:
If the implementation is complete different from the current, it does not mean you have to rewrite old code. Just you have to add a new function. Because it is a duplicate scan, I would add it to this plugin, not to a new one.

For performance think of this:
Code:
Get a list of unique authors - authors[Author, isAuthFlag = true]
Get a list of unique titles - titles[Title, isAuthFlag = false]
^^^^My hash sets^^^^
Merge both lists - BasicList[Text, isAuthFlag, matchFlag = false]
Sort BasicList alphabetically (on text)
Iterate BasicList 
{
   if current isAuthFlag == next isAuthFlag 
   {
      functionRemove(id)
   }
   else if current Text matches next Text
   {
      set matchFlag true
      functionRemove(id+1)
      proceed with next
   }
   else
   {
      functionRemove(id)
   }
   if no next and no match
   {
      functionRemove(id)
   }
}
if BasicList.size > 0
{
   iterate BasicList
   {
      lookup text in titles
      Lookup authors of this book
      If match (using one of the already implemented functions (identical, sim, fuz, sound)
      {
         add book to matchlist
         remove id from BasicList
      }
      else
      {
         remove id from BasicList
      }
   }
}

functionRemove(id)
{
   if id.ismatch == false
   remove id from list
}
EDIT, during typing I changed some code inside my mind, I think in a few minutes the matchFlag has become obsolete.

Last edited by drMerry; 05-01-2011 at 07:40 AM. Reason: updated some 'sample code'
drMerry is offline   Reply With Quote
Old 05-01-2011, 01:11 PM   #244
kiwidude
Calibre Plugins Developer
kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.
 
Posts: 4,637
Karma: 2162064
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
drMerry - one thing I don't think you have commented on as yet is the memory stability of this version. You had issues with using 1.0 on an old laptop with a lot of exemptions - have you tried repeating the scenario with the new version and is that problem resolved?
kiwidude is offline   Reply With Quote
Old 05-01-2011, 02:22 PM   #245
drMerry
Addict
drMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmos
 
drMerry's Avatar
 
Posts: 293
Karma: 21022
Join Date: Mar 2011
Location: NL
Device: Sony PRS-650
Quote:
Originally Posted by kiwidude View Post
drMerry - one thing I don't think you have commented on as yet is the memory stability of this version. You had issues with using 1.0 on an old laptop with a lot of exemptions - have you tried repeating the scenario with the new version and is that problem resolved?
whoops sorry
no problems any more!!
drMerry is offline   Reply With Quote
Old 05-01-2011, 09:23 PM   #246
kiwidude
Calibre Plugins Developer
kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.
 
Posts: 4,637
Karma: 2162064
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
v1.0.4 Beta

Changes in this release:
  • Compare co-authors in duplicate considerations rather than only the first author as was the case previously. So books by "A B" and "C D & A B" can now be matched together.
  • No longer delete cached hash values from previous binary comparison runs
  • Modified the Manage exemptions dialog to cater for multiple author exemptions for a book, showing each coauthor in its own tab.
  • Also modified that dialog to only show the relevant book/author exemptions section based on which type of exemptions your selection has.
  • A few bug-fixes for author exemptions not picked up from the last betas.

As always any feedback appreciated. Once again a number of core areas were affected by adding support for multiple author handling so there could be some gremlins lurking that my quick testing has not yet found.
Attached Thumbnails
Click image for larger version

Name:	Screenshot_4_ManageExemptions.png
Views:	268
Size:	27.1 KB
ID:	70795  

Last edited by kiwidude; 05-02-2011 at 09:03 AM. Reason: Removed attachment as later version in thread
kiwidude is offline   Reply With Quote
Old 05-02-2011, 04:00 AM   #247
chaley
Grand Sorcerer
chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.
 
Posts: 11,742
Karma: 6997045
Join Date: Jan 2010
Location: Notts, England
Device: Kobo Libra 2
Works a treat.

The new multiple authors stuff found a previously undetected problem. I had a book X1 with authors "B, A & D, C", and another book in the series X2 with authors "B, A and C D". The authors-only test zeroed right in on it.
chaley is offline   Reply With Quote
Old 05-02-2011, 05:10 AM   #248
kiwidude
Calibre Plugins Developer
kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.
 
Posts: 4,637
Karma: 2162064
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
Cool Charles, only fair that since you suggested the improvement you get some benefit out of it

There is one more scenario that this plugin will not catch and I don't know whether to try to cater for it. That is where the authors have the names in a different order but not swapped with comma. So you have A B and B, A will match. but A B and B A will not, nor will A, B with B, A.

You could find duplicate books using an ignore author search so that is fine if you do have a duplicate title. However a user might argue that an author duplicate based search should find this.

How often does it happen. Probably more than it should. I think the option of swapping names when adding authors is partly at fault, as if you have that selected it can give unintended results. It is once again that old chestnut of no setting for commas in a display name. So if I have a file with the name A B for author, and swap names checked, then I get the author B A rather than B, A. I think it could have it's logic tweaked to say if no comma when swapping then add one in and vice versa. However that might upset people who for some reason had the names stored without a comma the wrong way around and not want commas in the display name. How prevalent that is I do not know?

Now if the user downloads metadata and has overwrite author ticked, then the name gets fixed so the problem should go away. However there is still the issue for a lot of legacy books or where the user decides not to download metadata.

So... Is it something I should try to cater for? I think it should just be in the author duplicate (ignore title) searches if we do. There is the minor thing of it creating more false positives, such as two authors whose names when flipped happen to match but hopefully that is relatively rare and easily exempted.

EDIT: I'm going to dispute my own suggestion here (I do talk to myself a lot). I think that differentiating between title vs author searches is wrong, they should both have this check. The question should be just whether identical author searches should have it or not. Which I think the correct answer is no. It means more code twiddling on my part but hopefully a more consistent result so you can get A B / B,A / B A and A,B all in the same group.

Last edited by kiwidude; 05-02-2011 at 05:56 AM. Reason: Added extra thought
kiwidude is offline   Reply With Quote
Old 05-02-2011, 07:20 AM   #249
chaley
Grand Sorcerer
chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.
 
Posts: 11,742
Karma: 6997045
Join Date: Jan 2010
Location: Notts, England
Device: Kobo Libra 2
Quote:
Originally Posted by kiwidude View Post
How often does it happen. Probably more than it should. I think the option of swapping names when adding authors is partly at fault, as if you have that selected it can give unintended results. It is once again that old chestnut of no setting for commas in a display name. So if I have a file with the name A B for author, and swap names checked, then I get the author B A rather than B, A. I think it could have it's logic tweaked to say if no comma when swapping then add one in and vice versa. However that might upset people who for some reason had the names stored without a comma the wrong way around and not want commas in the display name. How prevalent that is I do not know?
Given that one of the author_sort tweak options asked for supports B A, I think it might happen more than one would want.
Quote:
EDIT: I'm going to dispute my own suggestion here (I do talk to myself a lot). I think that differentiating between title vs author searches is wrong, they should both have this check. The question should be just whether identical author searches should have it or not. Which I think the correct answer is no. It means more code twiddling on my part but hopefully a more consistent result so you can get A B / B,A / B A and A,B all in the same group.
The problem you are considering stems from the fact that the author name order might not be consistent. Yes, this happens all the time. However, isn't checking for this a problem for Quality Check? If I first verify using Quality Check that my names are the way I want and fix the ones that are wrong, then doesn't the problem you are addressing here go away?

It might be that the importance of this issue is sufficient to embed a small quality check into the dups code, so that installing the quality check plugin isn't required. The problem I see is that the check has nothing to do with duplicates, so the UI and code isn't quite right. I suppose that you could construct a single dup group containing books with authors that don't conform to the desired format.
chaley is offline   Reply With Quote
Old 05-02-2011, 07:43 AM   #250
kiwidude
Calibre Plugins Developer
kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.
 
Posts: 4,637
Karma: 2162064
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
An interesting point - given that I have been dishing out "this is not a duplicate issue, it should be in Quality Check" answers for others it is only fair I be made a similar suggestion to myself

I guess what I saw as being slightly different in this situation is that "Author Duplicate" searches are all about finding similar variants of the same name. We currently find A B and B, A (as well as A C. B etc etc). So I guess my thought on this was that if you are showing A B and B,A then why not also show A,B and B A. That's where I would suggest there is an argument to say the check "could" have something to do with duplicates. However as with my own post edits you can see how such thinking can lead to it pervading the title based searches as well and the line becomes very grey.

However I completely agree that putting it as a Quality Check function is a consistent alternate approach. After all if we are saying that QC should try to detect titles having series info in them, and (one day) titles and authors being the wrong way around, why not also have a check for author names reversed. Particularly as it already has the slight variants such as checks for authors with/without commas. The downside is that it is "another thing" you must run from time to time.

From an implementation perspective if it was in Find Duplicates I was just thinking I could compute two hashes for authors names - one as per now, one as reversed. Store the alternate hash in a separate dictionary, then at the end do a pass through that dictionary testing to see if the reversed hash value exists in the candidate groups and if so merge the results together. Something like that anyways, not given it masses of thought.
kiwidude is offline   Reply With Quote
Old 05-02-2011, 07:54 AM   #251
chaley
Grand Sorcerer
chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.
 
Posts: 11,742
Karma: 6997045
Join Date: Jan 2010
Location: Notts, England
Device: Kobo Libra 2
Quote:
Originally Posted by kiwidude View Post
The downside is that it is "another thing" you must run from time to time.
That was my concern that led to the 'perhaps it should be here'.
Quote:
From an implementation perspective if it was in Find Duplicates I was just thinking I could compute two hashes for authors names - one as per now, one as reversed. Store the alternate hash in a separate dictionary, then at the end do a pass through that dictionary testing to see if the reversed hash value exists in the candidate groups and if so merge the results together. Something like that anyways, not given it masses of thought.
If we are trying to find duplicate books where the author names are swapped around, then simply adding all forms of the author to the candidate map should work. If there is a book X with author "A B", and another book X with author "B, A", then adding both permutations for X will generate two identical duplicate groups. These should be pruned, assuming you have added the set pruning stuff.

If we are doing author checks, then the same thing still works (I think). Adding all permutations of an author will generate two duplicate groups for every author that appears in both forms. Again, set pruning will take care of this.

I don't think there is a memory issue here, because number of additional candidate sets == the number of authors.

Edit: The number of additional candidate sets is equal to number of title/author pairs, not the number of authors. We will see if this is too big.

Last edited by chaley; 05-02-2011 at 08:53 AM.
chaley is offline   Reply With Quote
Old 05-02-2011, 08:14 AM   #252
kiwidude
Calibre Plugins Developer
kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.
 
Posts: 4,637
Karma: 2162064
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
Quote:
Originally Posted by chaley View Post
If we are trying to find duplicate books where the author names are swapped around, then simply adding all forms of the author to the candidate map should work. If there is a book X with author "A B", and another book X with author "B, A", then adding both permutations for X will generate two identical duplicate groups. These should be pruned, assuming you have added the set pruning stuff.
True, I had forgotten about the set pruning (which is there). Like I said, I hadn't given it a huge amount of thought outside of whether we want it and where

I think I will put it in this plugin. Unlike some of the other suggestions this is one example where you can *only* detect it by having multiple *variants* of the author. So any occurrences not found by the find duplicates approach could not be found by any other means (except for comparing with some sort of external authors database). It is fairly easy and "cheap" to add here, and as you have said above it could occur in people's databases rather more often than they might like if they have been a bit undisciplined in their approach to adding books.
kiwidude is offline   Reply With Quote
Old 05-02-2011, 09:03 AM   #253
kiwidude
Calibre Plugins Developer
kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.
 
Posts: 4,637
Karma: 2162064
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
v1.0.5 Beta

Changes in this release:
  • Include swapping author name order (to cater for invalid metadata) in all but identical author checks. So A B / B A or A,B / B,A will match.
  • Prevent the user from choosing the Ignore title, Identical Author combination (as this will never produce duplicates)

Barring anything raised here this is the code I will release as v1.1 later today.

Last edited by kiwidude; 05-02-2011 at 01:18 PM. Reason: Removed attachment as later version in thread
kiwidude is offline   Reply With Quote
Old 05-02-2011, 11:41 AM   #254
chaley
Grand Sorcerer
chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.
 
Posts: 11,742
Karma: 6997045
Join Date: Jan 2010
Location: Notts, England
Device: Kobo Libra 2
One small problem: I have a book with two different authors with the same soundex value. When doing an authors-only check, I get a group with one book in it. Was this intended?


Edit: It also would be nice to have "Don't show me again" checkboxes, at least on the add exemption warning dialogs.

Last edited by chaley; 05-02-2011 at 11:43 AM.
chaley is offline   Reply With Quote
Old 05-02-2011, 11:43 AM   #255
kiwidude
Calibre Plugins Developer
kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.
 
Posts: 4,637
Karma: 2162064
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
Hmmm... a new situation pops up from the co-author changes. No doubt you could also get the same scenario if the co-authors were "similar" etc. Thanks for flagging this up. I guess the question is - is this actually invalid?
kiwidude is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Duplicate Detection Philosopher Library Management 114 09-08-2022 07:03 PM
[GUI Plugin] Plugin Updater **Deprecated** kiwidude Plugins 159 06-19-2011 12:27 PM
Duplicate Detection albill Calibre 2 10-26-2010 02:21 PM
New Plugin Type Idea: Library Plugin cgranade Plugins 3 09-15-2010 12:11 PM
Help with Chapter detection ubergeeksov Calibre 0 09-02-2010 04:56 AM


All times are GMT -4. The time now is 07:55 PM.


MobileRead.com is a privately owned, operated and funded community.