Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre > Library Management

Notices

Reply
 
Thread Tools Search this Thread
Old 03-22-2020, 09:57 AM   #1
nurbles62
Junior Member
nurbles62 began at the beginning.
 
Posts: 9
Karma: 10
Join Date: Jun 2018
Device: Kindle for Samsung Tablets
Lightbulb Duplicate Detection for "Add Books" is too weak

calibre's duplicate detection when adding books is far, far too weak to ever allow it to automatically ignore or merge things it flags as duplicates. It appears to me that only the titles are considered, but I'm not convinced it is performing an exact match on the titles, either.

Regardless, far, far too many short titles are use by different authors for different books and "Add Books" always marks them INCORRECTLY as duplicates.

I wish we could have at least a couple simple options to control this (so that folks who like it like this [if there are any]) can keep it: include author match would be the primary feature new option and would resolve most issues.

To be fancier, ways to handle multiple authors, title patterns, and different actions for exact matches and "near"/"possible" patches would all be nice.

Or maybe there is already a plug-in that repairs this functionality? Is that even possible? If the developer(s) reading this, please, PLEASE consider improving this feature!
nurbles62 is offline   Reply With Quote
Old 03-22-2020, 10:04 AM   #2
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 45,595
Karma: 28548962
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Every time you add a book the entire library has to be scanned for duplicates, this is waaaaay to slow for large libraries, if the algorithm is made more flexible. Duplicate detection is not going to be made stronger. Simply add the duplicates and use the duplicate finder plugin if you need better algorithms.
kovidgoyal is offline   Reply With Quote
Old 03-22-2020, 10:11 AM   #3
theducks
Well trained by Cats
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 31,240
Karma: 61360164
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
Things like this is why many of us use an INTAKE Library.
1) We beat the metadata into shape. Refine the tags.
2) Do any other cleaning tasks
3)Use the Find Duplicate Plugin and run the Find Library Duplicates option against the destination Library
theducks is offline   Reply With Quote
Old 03-22-2020, 01:35 PM   #4
DNSB
Bibliophagist
DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.
 
DNSB's Avatar
 
Posts: 47,921
Karma: 174315098
Join Date: Jul 2010
Location: Vancouver
Device: Kobo Sage, Libra Colour, Lenovo M8 FHD, Paperwhite 4, Tolino epos
Quote:
Originally Posted by theducks View Post
Things like this is why many of us use an INTAKE Library.
1) We beat the metadata into shape. Refine the tags.
2) Do any other cleaning tasks
3)Use the Find Duplicate Plugin and run the Find Library Duplicates option against the destination Library
↑ ↑ ↑

What theducks said. I suspect we've all had a new ebook where the embedded information from the publisher is something like Star Paths: A novel of space colonization by Herbert A Patrick and you find you already have a duplicate called Star Paths by Herbert A. Patrick after you clean up the metadata and run Find Duplicates. You could use the Find Duplicates plugin with options to find the book without cleaning up the metadata using it's various search options.
DNSB is offline   Reply With Quote
Old 03-22-2020, 04:20 PM   #5
BetterRed
null operator (he/him)
BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.
 
Posts: 22,001
Karma: 30277294
Join Date: Mar 2012
Location: Sydney Australia
Device: none
Quote:
Originally Posted by theducks View Post
Things like this is why many of us use an INTAKE Library.
1) We beat the metadata into shape. Refine the tags.
2) Do any other cleaning tasks
3)Use the Find Duplicate Plugin and run the Find Library Duplicates option against the destination Library
↑ ↑ ↑ ✔

When a duplicate is found in step 3, you may want to see the book that exists in the destination library, i.e. its cover, metadata and format files. You can do that via the calibre-spy plugin which provides read-only access to calibre libraries.

BR
BetterRed is offline   Reply With Quote
Old 03-27-2020, 12:48 PM   #6
nurbles62
Junior Member
nurbles62 began at the beginning.
 
Posts: 9
Karma: 10
Join Date: Jun 2018
Device: Kindle for Samsung Tablets
Quote:
Originally Posted by kovidgoyal View Post
Every time you add a book the entire library has to be scanned for duplicates, this is waaaaay to slow for large libraries, if the algorithm is made more flexible. Duplicate detection is not going to be made stronger. Simply add the duplicates and use the duplicate finder plugin if you need better algorithms.
May I ask why it is so slow? All of the strings that need to be checked appear to already be in memory (the title and authors are in the list of books in the library, after all) and a binary search for the two fields would not seem to particularly slow, since uniques should fall out quickly. Would adding an option to include the author(s) in the compare really make it much slower -- after all, you would only even GET to the author compare AFTER the title was found to be a duplicate.

For reference, I've been a programmer for a little over 40 years and I need to do something similar [I think] fairly often, and as long as everything's in memory it can be done pretty quickly. In fact, by sorting the test cases, too, quite a lot of the initial searching can also be minimized.
nurbles62 is offline   Reply With Quote
Old 03-27-2020, 03:55 PM   #7
DNSB
Bibliophagist
DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.
 
DNSB's Avatar
 
Posts: 47,921
Karma: 174315098
Join Date: Jul 2010
Location: Vancouver
Device: Kobo Sage, Libra Colour, Lenovo M8 FHD, Paperwhite 4, Tolino epos
Quote:
Originally Posted by nurbles62 View Post
Would adding an option to include the author(s) in the compare really make it much slower -- after all, you would only even GET to the author compare AFTER the title was found to be a duplicate.
Sadly, using the author in comparing when adding books would be not all that useful given how often the authors names are gibbled. As an example, how would you handle an author whose name is Henry Beam Piper best known as H. Beam Piper when the ebook creator has used H Beam Piper, H. Beam Piper, Henry B Piper, Henry B. Piper, H B Piper, HB Piper, H. B. Piper and H.B. Piper. This is one of reasons that I import books into an intake library where I edit the metadata, check for duplicates and so forth before moving to my main library.
DNSB is offline   Reply With Quote
Old 03-27-2020, 06:45 PM   #8
BetterRed
null operator (he/him)
BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.
 
Posts: 22,001
Karma: 30277294
Join Date: Mar 2012
Location: Sydney Australia
Device: none
Quote:
Originally Posted by nurbles62 View Post
May I ask why it is so slow? All of the strings that need to be checked appear to already be in memory (the title and authors are in the list of books in the library, after all) and a binary search for the two fields would not seem to particularly slow, since uniques should fall out quickly. Would adding an option to include the author(s) in the compare really make it much slower -- after all, you would only even GET to the author compare AFTER the title was found to be a duplicate.

For reference, I've been a programmer for a little over 40 years and I need to do something similar [I think] fairly often, and as long as everything's in memory it can be done pretty quickly. In fact, by sorting the test cases, too, quite a lot of the initial searching can also be minimized.
calibre is open source, see calibre user manual ==>> Setting up a calibre development environment

Kovid will always consider patches, or you could write a File Type plugin specific to your needs, they get used when a book is added.

BR
BetterRed is offline   Reply With Quote
Old 03-27-2020, 07:20 PM   #9
BetterRed
null operator (he/him)
BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.
 
Posts: 22,001
Karma: 30277294
Join Date: Mar 2012
Location: Sydney Australia
Device: none
Quote:
Originally Posted by BetterRed View Post
<snip>

... or you could write a File Type plugin specific to your needs, they get used when a book is added.
On reflection maybe a File Type plugin wouldn't be suitable, I don't think they have user interaction.

BR
BetterRed is offline   Reply With Quote
Old 03-27-2020, 10:20 PM   #10
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 45,595
Karma: 28548962
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Quote:
Originally Posted by nurbles62 View Post
May I ask why it is so slow? All of the strings that need to be checked appear to already be in memory (the title and authors are in the list of books in the library, after all) and a binary search for the two fields would not seem to particularly slow, since uniques should fall out quickly. Would adding an option to include the author(s) in the compare really make it much slower -- after all, you would only even GET to the author compare AFTER the title was found to be a duplicate.

For reference, I've been a programmer for a little over 40 years and I need to do something similar [I think] fairly often, and as long as everything's in memory it can be done pretty quickly. In fact, by sorting the test cases, too, quite a lot of the initial searching can also be minimized.
It's an O(n^2) algorithm vs an O(1) algorithm, via a hashmap of normalized title values. As a programmer of 40 yrs standing you should understand the consequences. And author names have too much variation to be able to perform a useful normalized O(1) check on them.
kovidgoyal is offline   Reply With Quote
Reply

Tags
add books, duplicates

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Language "detection" when adding books McGonigle Library Management 2 07-14-2014 06:26 AM
Google seeks patent to add "triggered sounds" to e-books Alexander Turcic News 51 09-27-2013 05:51 PM
A warning for Linux users: slow "Add Books", "Unknown" title and Author rolgiati Library Management 8 07-24-2013 04:36 PM
Duplicate Books named "Unknown", Why created anyway & How to get rid off them safely? KWhytte Library Management 10 09-01-2012 10:17 AM
[Enhancement suggestion] Folders when save books in "Add Books" function simonbcn Calibre 1 08-30-2009 12:59 PM


All times are GMT -4. The time now is 05:15 AM.


MobileRead.com is a privately owned, operated and funded community.