View Single Post
Old 01-01-2013, 09:29 AM   #359
sethcohn
Junior Member
sethcohn began at the beginning.
 
Posts: 6
Karma: 10
Join Date: Jun 2005
Redirected here from https://www.mobileread.com/forums/sho...d.php?t=201256

Reading this thread, it seems folks have asked for 'binary' close before, and that request has been rejected. To be clear, I'm asking for a binary identical EXCEPT for certain files in the book (ie metadata related, like UUID, calibre related, etc)

Looking over the code in find duplicates, seems nontrivial to me but Kovid thinks otherwise. You can't use the entire file to hash, you have to consider the file minus the parts like the metadata and other excluded items, but I'm not a python or Calibre wiz, so not sure how much work this would take.

An example might be good here: 2 files, both converted from the same source material, but done at different times, using identical settings for conversion, but perhaps with different versions of Calibre, will generate files that are _close_ to identical, but fail binary dupe, because of the UUID, the timestamps, the Calibre version.... maybe a Calibre bookmark file, and so on. A function to identify _these_ as duplicate _would_ be useful. If the files were converted using different settings, if one file has split html inside and the other not, that's not identical and should be looked at manually (I agree with past discussions), but in this case (and I've got a lot of these), these files are identical in every way that matters, yet fail the binary test, due to factors I can't control for. Even rebuilding these into new books will continue to fail because the UUIDs and timestamps will continue to remain different. (Even a (re)build of the same book twice in a row as two different books, these should be flaggable as identical, but aren't, due to timestamps in the metadata and thus the hashes are different, even if UUID is the same.)
sethcohn is offline   Reply With Quote