MobileRead Forums - View Single Post

kiwidude · 04-28-2011, 01:52 PM

Right, so what am I doing wrong - computing sha256+md5 gives me exactly the same number of results. I've printed the hashes and paths out to make sure I am not doing anything stupid like the "wrong" file etc. I've opened the docs up in the duplicate group and they have exactly the same file size - but definitely not the same books.

74950 formats scanned resulted in 3195 size collisions (of ??? formats)
Second pass resulted in 64 sha256+md5+size collisions of 134 formats

Here is the hashing code that takes place in the second pass (on the set of books with size collisions)...

Code:

    def find_candidate_by_hash(self, book_id, path, size, candidates_map):
        try:
            content = file(path).read()
            sha = hashlib.sha256()
            sha.update(content)
            md5 = hashlib.md5()
            md5.update(content)
            hash = '%s%s%d' % (sha.hexdigest(), md5.hexdigest(), size)
            candidates_map[hash].add(book_id)
        except:
            traceback.print_exc()
            pass

04-28-2011, 01:52 PM	#190
kiwidude Calibre Plugins Developer Posts: 4,733 Karma: 2197770 Join Date: Oct 2010 Location: Australia Device: Kindle Oasis	Right, so what am I doing wrong - computing sha256+md5 gives me exactly the same number of results. I've printed the hashes and paths out to make sure I am not doing anything stupid like the "wrong" file etc. I've opened the docs up in the duplicate group and they have exactly the same file size - but definitely not the same books. 74950 formats scanned resulted in 3195 size collisions (of ??? formats) Second pass resulted in 64 sha256+md5+size collisions of 134 formats Here is the hashing code that takes place in the second pass (on the set of books with size collisions)... Code: def find_candidate_by_hash(self, book_id, path, size, candidates_map): try: content = file(path).read() sha = hashlib.sha256() sha.update(content) md5 = hashlib.md5() md5.update(content) hash = '%s%s%d' % (sha.hexdigest(), md5.hexdigest(), size) candidates_map[hash].add(book_id) except: traceback.print_exc() pass Last edited by kiwidude; 04-28-2011 at 01:59 PM. Reason: Fix my numbers