MobileRead Forums - View Single Post

Giuseppe Chillem · 05-07-2010, 03:19 PM

Quote:

Originally Posted by kovidgoyal

The problem isn't that files with the same hash will be different, the problem is that files with different hashes may be the same. For example they may have slightly different metadata or have stored annotations, and still be logically the same book.

You are right. Nice shot !

However, if you are in the early stage of book inporting (for example, merging book collections), and you have not changed metadata, with CRC32 + SIZE you have a 100% hit.

Thanks to your POV I whish to change a little my request.

Here is the target scenario:

Calibre crashes during inport. Part of files have been inported. Some of these files have metadata equal to other (I have found some CHM having the same "Generated by Unregistered Version"). If you discard duplicates, you discard false duplicates too. It does actually happen, I have ecnountered this problem just the first time I have used Calibre.

Here Is the proposal:

A two round check, the first is CRC32 + SIZE, the second is the actual mechanism. This would give you 3 lists: 1) physical duplicates, 2) Physical and Metadata Duplicates 3) Metadata Duplicates.

Then you request the user: DUPLICATES FOUND, what you want to delete ? "Same Physical Files; Same Physical Files + Metadata; Only Metadata; None"

What you think about this proposal ?

Giuseppe Chillemi