Right, so what am I doing wrong - computing sha256+md5 gives me exactly the same number of results. I've printed the hashes and paths out to make sure I am not doing anything stupid like the "wrong" file etc. I've opened the docs up in the duplicate group and they have exactly the same file size - but definitely not the same books.
74950 formats scanned resulted in 3195 size collisions (of ??? formats)
Second pass resulted in 64 sha256+md5+size collisions of 134 formats
Here is the hashing code that takes place in the second pass (on the set of books with size collisions)...
Code:
def find_candidate_by_hash(self, book_id, path, size, candidates_map):
try:
content = file(path).read()
sha = hashlib.sha256()
sha.update(content)
md5 = hashlib.md5()
md5.update(content)
hash = '%s%s%d' % (sha.hexdigest(), md5.hexdigest(), size)
candidates_map[hash].add(book_id)
except:
traceback.print_exc()
pass