View Single Post
Old 04-28-2011, 01:52 PM   #190
kiwidude
Calibre Plugins Developer
kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.
 
Posts: 4,733
Karma: 2197770
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
Right, so what am I doing wrong - computing sha256+md5 gives me exactly the same number of results. I've printed the hashes and paths out to make sure I am not doing anything stupid like the "wrong" file etc. I've opened the docs up in the duplicate group and they have exactly the same file size - but definitely not the same books.

74950 formats scanned resulted in 3195 size collisions (of ??? formats)
Second pass resulted in 64 sha256+md5+size collisions of 134 formats

Here is the hashing code that takes place in the second pass (on the set of books with size collisions)...
Code:
    def find_candidate_by_hash(self, book_id, path, size, candidates_map):
        try:
            content = file(path).read()
            sha = hashlib.sha256()
            sha.update(content)
            md5 = hashlib.md5()
            md5.update(content)
            hash = '%s%s%d' % (sha.hexdigest(), md5.hexdigest(), size)
            candidates_map[hash].add(book_id)
        except:
            traceback.print_exc()
            pass

Last edited by kiwidude; 04-28-2011 at 01:59 PM. Reason: Fix my numbers
kiwidude is offline   Reply With Quote