Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Development

Notices

Reply
 
Thread Tools Search this Thread
Old 04-28-2011, 12:55 PM   #181
kiwidude
Calibre Plugins Developer
kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.
 
Posts: 4,730
Karma: 2197770
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
Ok, today's pop quiz question - who can offer me an efficient file comparison algorithm?

I've tried a first pass of finding books with the same size, and then a second pass using the sha256 hash. However this has two problems - (a) it is still pretty darn slow for large libraries (around 4.5 minutes to scan a 40,000 book library with a fair few formats), and (b) after all that it still isn't "accurate" enough, returning a bunch of duplicates which really aren't, they just "hash" together.

Suggestions on a postcard please
kiwidude is offline   Reply With Quote
Old 04-28-2011, 01:00 PM   #182
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 45,374
Karma: 27230406
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Use a composite hash (filesize, sha512) only calculate the latter when filesizes differ.
kovidgoyal is online now   Reply With Quote
Old 04-28-2011, 01:04 PM   #183
kiwidude
Calibre Plugins Developer
kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.
 
Posts: 4,730
Karma: 2197770
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
That is what I am doing currently (well except with sha256). I just tried it with sha512 and that made no difference on my smaller test library anyways in terms of the false positives.

I ran a third party product (Duplicate Cleaner) to see what it's performance was like. Maybe 4.5 minutes isn't so bad after all.

Perhaps I now just need to add a third pass to compare file streams - is there an efficient way of doing that?
kiwidude is offline   Reply With Quote
Old 04-28-2011, 01:07 PM   #184
chaley
Grand Sorcerer
chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.
 
Posts: 12,447
Karma: 8012886
Join Date: Jan 2010
Location: Notts, England
Device: Kobo Libra 2
Are you saying that you have several cases where for two files a & b, a != b AND size(a) == size(b) AND SHA256(a) == SHA256(b)? This is very surprising.

You might try using multiple hashes when you have a size/SHA256 collision. It should be very rare that SHA256 and MD5 both collide on the same 2 files.
chaley is offline   Reply With Quote
Old 04-28-2011, 01:10 PM   #185
kiwidude
Calibre Plugins Developer
kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.
 
Posts: 4,730
Karma: 2197770
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
Yup, that is exactly what I am saying 64 cases of it in fact in the large library.

I'll try multiple hashes. I wonder what is faster - reading a larger number of files once and computing two hashes, or having the third pass work on the smaller subset but having to read the files again...
kiwidude is offline   Reply With Quote
Old 04-28-2011, 01:17 PM   #186
mehetabelo
e-Bibliophile
mehetabelo began at the beginning.
 
mehetabelo's Avatar
 
Posts: 60
Karma: 10
Join Date: Jun 2009
Location: California
Device: Paperwhite 1-3, Kobo AuraHD, Boox Afterglow2
One of the things that may be useful to people adding large libraries, is a scan which checks the 'author' as a 'title' and vice-versa. When adding thousands of ebooks from a library verifying if the author and title match up correctly is not always an easy task.

If your duplicate detector checks this, it can help clean up what may be a duplicate with flipped author/title metadata, allowing for easier cleanup later.

Personally, to help avoid duplication, I generally now work with multiple databases, using one as a 'cleanup' database, which basically means it has been dumped in without review. I've added extra columns for quality, origination and a 'notes' column if I need to make remarks that are not intended for the general book comments. This makes sorting easier. Once basic cleanup has been done and there's no duplicates that I can find during this process, they're moved into the cleaned database.

Using the quality column (basically a ratings column for the actual publishing quality of the book) makes the duplicate detector function even easier to use. Personally if I know a book has a 5 star rating (retail format, with full ToC, title page, good cover...) then filtering through duplicates goes much quicker.

Anyway, I've gone far beyond the original thoughts/suggestions.

On a side note (possible bug), I didn't fully read every comment in this and the newer thread in the plugins area, so I'm not sure if this has been noted. I have noticed that I cannot get the shortcut key to work to for the find duplicates. I assigned a key, several different ones just to verify it wasn't the key combination, using both alt and ctrl as a primary key, but to get the 'Find Duplicates' menu to pop up does not work. This is with the older .6(?) version and the newer 1.0. The shortcut keys to move forward and backward do work as intended, so I simply have to use the button instead of a shortcut key to bring up the primary menu.

Thank you for the plugin. It's worked wonders for helping me clean up my databases.
mehetabelo is offline   Reply With Quote
Old 04-28-2011, 01:26 PM   #187
kiwidude
Calibre Plugins Developer
kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.
 
Posts: 4,730
Karma: 2197770
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
@mehetabelo:
I hadn't given any thought to title/author swapping but it is an interesting issue you raise. I guess the question is whether that is a function of a duplicate checking plugin or whether some other plugin should try to detect this (like Quality Check).

Certainly using a "staging database" towards a clean database is an approach I've heard a number of people taking, your approach sounds very diligent

Re the keyboard shortcuts - I probably forgot to wire them in . I will take a look.
kiwidude is offline   Reply With Quote
Old 04-28-2011, 01:40 PM   #188
mehetabelo
e-Bibliophile
mehetabelo began at the beginning.
 
mehetabelo's Avatar
 
Posts: 60
Karma: 10
Join Date: Jun 2009
Location: California
Device: Paperwhite 1-3, Kobo AuraHD, Boox Afterglow2
The quality check, which I use until one of the recent updates (the right-click menu was driving me nuts and I couldn't figure out how to turn it off, plus 'disable' doesn't work even with a clean calibre install) so I uninstalled it temporarily while I'm working with a larger database, doing some general large dumps of ebooks. While the QC plugin is nice for this, the database has 30k ebooks and thousands upon thousands have multiple problems.

The reason I believe it is easier to use the duplicate detector first is because, while QC is nice, if you discover that the book you just 'fixed' is a duplicate, you spent a few minutes working on editing a books metadata that was unnecessary, and even in the case of finding books which QC would find, if you have two books, and one has good metadata including isbn, comments, series... and a second book with just a wrong author/title (because they're switched) and nothing else, but the second book is a better format for one reason or another, it's easier to remove all formats from the first instance, with correct metadata and drop the second book into the first with merge. In the long run, this makes it much quicker to clean up a database by skipping over steps (like metadata cleanup) when they are already unnecessary for particular books.

I'm not demanding it be done, I've done a little programming myself years ago, and discovered that while I could do it, it wasn't to be my area of expertise. From that experience though, I realize that while some people enjoy programming, it's not necessarily fun (it can be in some cases when actually getting something to work that's driven you nuts) or quick. I just wanted to note that it might make cleanup an easier job if that was an option.

edit: I just reread your post and realized you meant possibly having QC check if the author/title were reversed, and though my above statement still hold true (in my opinion) it is an option to use QC to do this. Still, I would think based on how the duplicate detector works, it would be better to do it withing this plugin and not QC, it would bring together any possible duplicates and make it simpler to decide which one to delete (and which has the better metadata already)

Last edited by mehetabelo; 04-28-2011 at 01:46 PM.
mehetabelo is offline   Reply With Quote
Old 04-28-2011, 01:43 PM   #189
kiwidude
Calibre Plugins Developer
kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.
 
Posts: 4,730
Karma: 2197770
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
Re Quality Check - any issues you have you should post in it's thread, it just so happens I might know the guy who wrote it. It sounds like though you just wanted it out of your right-click menu (I just keep it on the toolbar myself). You can do this in Preferences->Toolbar and place it wherever you like. And no you cannot "disable" a UI plugin - you either remove it from your menu/toolbar or uninstall it as you did.
kiwidude is offline   Reply With Quote
Old 04-28-2011, 01:52 PM   #190
kiwidude
Calibre Plugins Developer
kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.
 
Posts: 4,730
Karma: 2197770
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
Right, so what am I doing wrong - computing sha256+md5 gives me exactly the same number of results. I've printed the hashes and paths out to make sure I am not doing anything stupid like the "wrong" file etc. I've opened the docs up in the duplicate group and they have exactly the same file size - but definitely not the same books.

74950 formats scanned resulted in 3195 size collisions (of ??? formats)
Second pass resulted in 64 sha256+md5+size collisions of 134 formats

Here is the hashing code that takes place in the second pass (on the set of books with size collisions)...
Code:
    def find_candidate_by_hash(self, book_id, path, size, candidates_map):
        try:
            content = file(path).read()
            sha = hashlib.sha256()
            sha.update(content)
            md5 = hashlib.md5()
            md5.update(content)
            hash = '%s%s%d' % (sha.hexdigest(), md5.hexdigest(), size)
            candidates_map[hash].add(book_id)
        except:
            traceback.print_exc()
            pass

Last edited by kiwidude; 04-28-2011 at 01:59 PM. Reason: Fix my numbers
kiwidude is offline   Reply With Quote
Old 04-28-2011, 02:05 PM   #191
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by kiwidude View Post
Yup, that is exactly what I am saying 64 cases of it in fact in the large library.
Something is horribly wrong here.

This is the birthday collision problem, and assuming uniform distribution of the hash over the 0..2^256 number space (10^77 unique hashes), you'd need about 10^58 books in your library to have a 50:50 chance of one SHA256 hash collision. With 10^29 books in your library, the chance of one collision is down to one in 10^18.
Starson17 is offline   Reply With Quote
Old 04-28-2011, 02:13 PM   #192
kiwidude
Calibre Plugins Developer
kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.
 
Posts: 4,730
Karma: 2197770
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
I guess I have a lot of birthdays...

I think I may have an idea. It seems that every single collision group comes with .doc files.

Perhaps it is a problem with the way I read the file content? Or maybe those algorithms get fooled by .doc content?
kiwidude is offline   Reply With Quote
Old 04-28-2011, 02:20 PM   #193
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 45,374
Karma: 27230406
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
kiwidude: You can use a tuple directly as a key for a dict, you dont have to convert to a string.

Try verifying that the hashes are actually the same for your different files with a different hashing tool, just to ensure there isn't a bug with the hashlib library (although I find that rather unlikely).

If they are indeed hash collisions, then you can have a final check that compares reported duplicates byte-to-byte. Since there are very few of these, it shouldn't be slow.
kovidgoyal is online now   Reply With Quote
Old 04-28-2011, 02:20 PM   #194
chaley
Grand Sorcerer
chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.
 
Posts: 12,447
Karma: 8012886
Join Date: Jan 2010
Location: Notts, England
Device: Kobo Libra 2
The problem is the file(xxx).read(). That is opening the file in text mode, not binary mode. The read is finding a ^Z, which is end of file. The difference between binary and text mode is one of the silliest things that MS ever did.

Use something like
Code:
handle = open(filename, 'rb')
content = handle.read()
handle.close()
chaley is offline   Reply With Quote
Old 04-28-2011, 02:23 PM   #195
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 45,374
Karma: 27230406
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
chaley: Good catch. Better code is

Code:
with open(path, 'rb') as f:
    content = f.read()
kovidgoyal is online now   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Duplicate Detection Philosopher Library Management 114 09-08-2022 07:03 PM
[GUI Plugin] Plugin Updater **Deprecated** kiwidude Plugins 159 06-19-2011 12:27 PM
Duplicate Detection albill Calibre 2 10-26-2010 02:21 PM
New Plugin Type Idea: Library Plugin cgranade Plugins 3 09-15-2010 12:11 PM
Help with Chapter detection ubergeeksov Calibre 0 09-02-2010 04:56 AM


All times are GMT -4. The time now is 11:18 AM.


MobileRead.com is a privately owned, operated and funded community.