Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Development

Notices

Reply
 
Thread Tools Search this Thread
Old 04-30-2011, 06:50 PM   #226
kiwidude
Calibre Plugins Developer
kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.
 
Posts: 4,637
Karma: 2162064
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
I have updated to 1.0.3 on the post above to ensure search_getting_ids is passed the current search restriction value from self.gui.tags_view.search_restriction

No this wasn't a new bug, it was an old one that was always there for ISBNs, and copy/paste of the approach made it also apply to binary comparisons.

I don't know what you mean by results tomorrow, you don't mean running overnight do you? The binary search of 8000 books should complete pretty quickly. On my system it did 75,000 formats from 40,000 books in 4 minutes (about 65 seconds on a re-run).
kiwidude is offline   Reply With Quote
Old 04-30-2011, 06:54 PM   #227
drMerry
Addict
drMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmos
 
drMerry's Avatar
 
Posts: 293
Karma: 21022
Join Date: Mar 2011
Location: NL
Device: Sony PRS-650
well, that is what I mend,

I'm going to sleep now
It's still running, but my nas is overloaded and I'm watching a show on the Internet so it's going very slow.

If it takes too long (I know by tomorrow) I will report it.
Have a good night.
drMerry is offline   Reply With Quote
Advert
Old 05-01-2011, 04:26 AM   #228
chaley
Grand Sorcerer
chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.
 
Posts: 11,742
Karma: 6997045
Join Date: Jan 2010
Location: Notts, England
Device: Kobo Libra 2
Quote:
Originally Posted by kiwidude View Post
This is a preview before an official 1.1 release. I would greatly appreciate it if a few people could give it a sanity check before I release it in the plugins forum thread. There have been a significant number of internal changes so I would rather get any problems found here first
Looks good. Certainly haven't been able to break it yet.

Two thoughts:
- I suggested that you use delete_first = True when calling add_multiple_custom_book_data. After reading the posts about using a restriction on binary matching, I am no longer convinced I was right. If I run the check over my entire library, all values are cached. If I then run it over a restricted set, all values are thrown away except the small set. Next time I run it over the full library, all must be recomputed. This seems suboptimal. The thought in using the delete first was to clean the cache, but I think I just demonstrated that we really don't want to do that.

- The author match algorithm seems to operate only on first authors. For fun I duplicated a book then swapped the author names, and it isn't found as a dup. Should it, or should it compare author-by-author? Continuing on the same vein, if I am doing an author-only dup check, shouldn't I be looking only at authors and shouldn't each author be taken into account?

For example, assume I have two books titled Xyzzy, one by "Blogs, Joe & Angstrom, Alice" and one by "Angstrom, Alice & Blogs, Joe". In one school of thought, these books have identical titles and identical authors, yet no title+author search option will find them. This could be fixed by creating an entry in the candidates map for each author, not just the first author. Note that if this is done, the algorithm would decide that Xyzzy by "A, B & C, D" matches Xyzzy by "C, D", which I think is the right answer.

Another example: again I have two books, one by "Blogs, Joe & Angstrom, Alice" and another by "Anngstrom, Alice". I do a search by author, ignoring title. I would expect a soundex search to put these two books into a group because they share an author that matches soundex but is spelled differently. This is a different case than above, because I am explicitly looking for authors; order should be irrelevant. Like above, this can be fixed by creating an entry in the candidate map for each author, not just the first author.
chaley is offline   Reply With Quote
Old 05-01-2011, 05:26 AM   #229
drMerry
Addict
drMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmos
 
drMerry's Avatar
 
Posts: 293
Karma: 21022
Join Date: Mar 2011
Location: NL
Device: Sony PRS-650
Another function that should be usefull:

Author/Title - swap
By binary comparison I found a duplicate books I would not have found with other functions:
Book 1:
Anna Karenina by Leo Tolstoy
Book 2:
Leo Tolstoy by Anna Karenina

Also found 2 other books with this problem (both where books with a name as title so it was not that obvious when you take a quick look)
drMerry is offline   Reply With Quote
Old 05-01-2011, 05:27 AM   #230
drMerry
Addict
drMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmos
 
drMerry's Avatar
 
Posts: 293
Karma: 21022
Join Date: Mar 2011
Location: NL
Device: Sony PRS-650
When I did a rerun on the binary search with other network and cpu consuming programs closed, It worked fast.
So your plugin has no speed issues (I did not even think it had, but to be sure...).
drMerry is offline   Reply With Quote
Advert
Old 05-01-2011, 05:38 AM   #231
kiwidude
Calibre Plugins Developer
kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.
 
Posts: 4,637
Karma: 2162064
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
Hi Charles,

I actually wrote you an email asking you about the delete of the cache parameter. I didn't send it because I figured there must be something I was missing again and thought I had met my dumb question quota for the day. I guess I should have looked closer, haha.

The authors point is an interesting one. One issue which I didn't see you mention of having multiple author cache entries is replication of groups. So if I have a duplicate title with these variations of authors...

A, B & C, D
C, D & A, B

I will have two buckets. Both will have the same titles in them. There could be variations of this "issue" depending on how many other A, B matches or C, D matches there are so the buckets might be unbalanced.

I guess the question is whether this is a problem or not. If the user resolves their duplicates in order, the second group if identical would disappear automatically. If they skip through them with highlighting it may jump around a bit but still be valid. If they added exemptions using mark all groups it would create some duplication I think but not a major drama.
kiwidude is offline   Reply With Quote
Old 05-01-2011, 05:47 AM   #232
kiwidude
Calibre Plugins Developer
kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.
 
Posts: 4,637
Karma: 2162064
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
@drMerry, glad to hear you are happy with the performance. Just don't run the binary search with a restriction until I push a new version which doesn't delete your cache values.

In terms of title author swap I will give the same answer I gave previously. I don't see it as a find duplicates thing to address, for the same reasons that having series info in the title shouldn't be matched. What if you only have one book, regardless of which way it is flipped?

As I think I said previously I see this as something a plugin like quality check could try to check for, though without other matches for the author name it is hard to identify.
kiwidude is offline   Reply With Quote
Old 05-01-2011, 05:56 AM   #233
drMerry
Addict
drMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmos
 
drMerry's Avatar
 
Posts: 293
Karma: 21022
Join Date: Mar 2011
Location: NL
Device: Sony PRS-650
Quote:
Originally Posted by kiwidude View Post
In terms of title author swap I will give the same answer I gave previously. I don't see it as a find duplicates thing to address, for the same reasons that having series info in the title shouldn't be matched. What if you only have one book, regardless of which way it is flipped?
I do not get your point here.
only one book ... would never be a duplicate (or always since book A == book A)

The purpose of this plugin is to check for books that are twice in your library.
The purpose of the quality-check is to check individual books (in my opinion..)

So the previous point could not be found with quality check.

You will not want to check one book for possible author swap. (What is your reason to tell me Leo Tollstoy by Anna Katherina is swapped?)

If there are 2 books with same (similar) name and title but swapped you have a good indication for a mistake.
So this option is only possible by testing several books against each other, a function dupli-search does, but quali shouldn't

All my opinion of-course
drMerry is offline   Reply With Quote
Old 05-01-2011, 05:57 AM   #234
drMerry
Addict
drMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmos
 
drMerry's Avatar
 
Posts: 293
Karma: 21022
Join Date: Mar 2011
Location: NL
Device: Sony PRS-650
I see I used your last part to state my 'truth'.

Quote:
As I think I said previously I see this as something a plugin like quality check could try to check for, though without other matches for the author name it is hard to identify.
drMerry is offline   Reply With Quote
Old 05-01-2011, 06:09 AM   #235
kiwidude
Calibre Plugins Developer
kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.
 
Posts: 4,637
Karma: 2162064
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
The point of mine you are missing is that finding books with title/author are swapped that are duplicates of each other is only a subset that could be of size zero of books in your library that have title/author swapped.

So to properly solve this problem requires a more comprehensive and multiple pronged approach than just looking for a duplicate title.
kiwidude is offline   Reply With Quote
Old 05-01-2011, 06:12 AM   #236
chaley
Grand Sorcerer
chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.
 
Posts: 11,742
Karma: 6997045
Join Date: Jan 2010
Location: Notts, England
Device: Kobo Libra 2
Quote:
Originally Posted by kiwidude View Post
The authors point is an interesting one. One issue which I didn't see you mention of having multiple author cache entries is replication of groups.
You are right. Hmmm...
Quote:
I guess the question is whether this is a problem or not. If the user resolves their duplicates in order, the second group if identical would disappear automatically. If they skip through them with highlighting it may jump around a bit but still be valid.
Thinking on paper and (I think) agreeing with you: what you are saying is that adding a book to multiple buckets can create a situation where one group is a (possibly improper) subset of another. It seems to me that there isn't much point in showing both groups, at least in author mode. For example, why show a group containing (1,2,3) and another containing (2,3)?

Subsets can be removed rather easily, with performance that should be acceptable if there aren't thousands of groups. Something like this:
Spoiler:
Code:
def clean_dup_groups(dups):
    res = [set(d) for d in dups]
    res.sort(cmp=lambda x, y: cmp(len(x), len(y)))
    ans = []
    for i,a in enumerate(res):
        for b in res[i+1:]:
            if a.issubset(b):
                break
        else:
            ans.append(a)
    return ans


dups = [(1,2,3),(4,5)]
print dups
print clean_dup_groups(dups)

print '========================'
dups = [(1,2,3,4,5), (1,6,7)]
print dups
print clean_dup_groups(dups)

print '========================'
dups = [(1,2,3,4,5), (1,6,7), (1,6,7)]
print dups
print clean_dup_groups(dups)

print '========================'
dups = [(1,2,3,4,5), (1,6,7), (3,4), (6,7)]
print dups
print clean_dup_groups(dups)


with output:
[(1, 2, 3), (4, 5)]
[set([4, 5]), set([1, 2, 3])]
========================
[(1, 2, 3, 4, 5), (1, 6, 7)]
[set([1, 6, 7]), set([1, 2, 3, 4, 5])]
========================
[(1, 2, 3, 4, 5), (1, 6, 7), (1, 6, 7)]
[set([1, 6, 7]), set([1, 2, 3, 4, 5])]
========================
[(1, 2, 3, 4, 5), (1, 6, 7), (3, 4), (6, 7)]
[set([1, 6, 7]), set([1, 2, 3, 4, 5])]

Quote:
If they added exemptions using mark all groups it would create some duplication I think but not a major drama.
Again, I don't see a reason to keep exemption groups that are subsets of another group. The same set cleanup would fix this, eliminating the subsets.
chaley is offline   Reply With Quote
Old 05-01-2011, 06:14 AM   #237
drMerry
Addict
drMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmos
 
drMerry's Avatar
 
Posts: 293
Karma: 21022
Join Date: Mar 2011
Location: NL
Device: Sony PRS-650
Well, that's right, but there is (in my opinion) no way to check all books for this.
If you have one (say a little known comic (all pictures)) book with swapped author / title without isbn, you can not find metadata for it.
So, you will never know if the info is swapped if you do not read it yourself.

The only way to solve some of the problems in your lib is to find books that have duplicates in it with swapped author/title.

It can not solve all your problems (like none of your plugins can (their still great )) but it will solve some.
While it is a multi-book operation and an operation to find possible duplicates, it should be in this plugin I think.

Properly (=complete) solving this problem is no option.
drMerry is offline   Reply With Quote
Old 05-01-2011, 06:24 AM   #238
kiwidude
Calibre Plugins Developer
kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.
 
Posts: 4,637
Karma: 2162064
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
Charles, glad to hear I wasn't imagining things. Totally agree that pruning the groups is the right thing to do, I was just thinking out loud of the impact of not handling it.
kiwidude is offline   Reply With Quote
Old 05-01-2011, 06:29 AM   #239
chaley
Grand Sorcerer
chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.
 
Posts: 11,742
Karma: 6997045
Join Date: Jan 2010
Location: Notts, England
Device: Kobo Libra 2
Quote:
Originally Posted by drMerry View Post
Another function that should be usefull:

Author/Title - swap
By binary comparison I found a duplicate books I would not have found with other functions:
Book 1:
Anna Karenina by Leo Tolstoy
Book 2:
Leo Tolstoy by Anna Karenina

Also found 2 other books with this problem (both where books with a name as title so it was not that obvious when you take a quick look)
Putting aside the function's desirability (I wouldn't use it), computationally the only way I see to do this is to create a temporary 'book' (a candidate in the plugin's terminology) for each book in the search set by swapping the authors & title, immediately doubling the size of the candidates set. If multiple authors are taken into account, then the size is increased even more. Assuming no multiple authors, then a 40,000-book library becomes 80,000. That number is getting scary-large, perhaps crossing the performance threshold (the tables go to VM and start page thrashing) or even faulting because the plugin runs out of memory.

Edit: it is worse than I thought. The author names must be added in both FN LN and LN, FN order, because we don't know the order of any 'names' in the title. That means 3 entries per book, not two.

Yes, one can say "don't do that", but we have hard experience that people will do 'unreasonable' things because they seem perfectly reasonable to them. Then kiwidude must deal with an unhappy user. Up to him if he wants to take on that burden.

Last edited by chaley; 05-01-2011 at 08:35 AM.
chaley is offline   Reply With Quote
Old 05-01-2011, 06:31 AM   #240
kiwidude
Calibre Plugins Developer
kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.
 
Posts: 4,637
Karma: 2162064
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
drMerry -I agree there is no 100% guaranteed way. However I utterly disagree that looking for duplicates is the only or best way, for the reasons I keep repeating - there will be many books in this situation of title/author beingvswapped which could be found very easily by non duplicate method approaches. And this will solve a far higher percentage of the problems than trying to bodge it into find duplicates will do. When I find the time it will appear in quality check which IMHO is a more appropriate place for it.

Will there still be edge cases that it cannot find? Absolutely. But I can guarantee that it will find many more cases than putting it in this plugin would.
kiwidude is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Duplicate Detection Philosopher Library Management 114 09-08-2022 07:03 PM
[GUI Plugin] Plugin Updater **Deprecated** kiwidude Plugins 159 06-19-2011 12:27 PM
Duplicate Detection albill Calibre 2 10-26-2010 02:21 PM
New Plugin Type Idea: Library Plugin cgranade Plugins 3 09-15-2010 12:11 PM
Help with Chapter detection ubergeeksov Calibre 0 09-02-2010 04:56 AM


All times are GMT -4. The time now is 05:08 PM.


MobileRead.com is a privately owned, operated and funded community.