View Single Post
Old 05-07-2010, 03:19 PM   #15
Giuseppe Chillem
Groupie
Giuseppe Chillem doesn't litterGiuseppe Chillem doesn't litter
 
Giuseppe Chillem's Avatar
 
Posts: 190
Karma: 134
Join Date: May 2010
Device: IREX DR1000
Quote:
Originally Posted by kovidgoyal View Post
The problem isn't that files with the same hash will be different, the problem is that files with different hashes may be the same. For example they may have slightly different metadata or have stored annotations, and still be logically the same book.
You are right. Nice shot !

However, if you are in the early stage of book inporting (for example, merging book collections), and you have not changed metadata, with CRC32 + SIZE you have a 100% hit.

Thanks to your POV I whish to change a little my request.

Here is the target scenario:

Calibre crashes during inport. Part of files have been inported. Some of these files have metadata equal to other (I have found some CHM having the same "Generated by Unregistered Version"). If you discard duplicates, you discard false duplicates too. It does actually happen, I have ecnountered this problem just the first time I have used Calibre.

Here Is the proposal:

A two round check, the first is CRC32 + SIZE, the second is the actual mechanism. This would give you 3 lists: 1) physical duplicates, 2) Physical and Metadata Duplicates 3) Metadata Duplicates.

Then you request the user: DUPLICATES FOUND, what you want to delete ? "Same Physical Files; Same Physical Files + Metadata; Only Metadata; None"

What you think about this proposal ?

Giuseppe Chillemi
Giuseppe Chillem is offline   Reply With Quote