MobileRead Forums - View Single Post

sethcohn · 01-01-2013, 11:16 PM

Quote:

Originally Posted by kiwidude

Find Duplicates has always had two very intentional limits placed on its scope - first that it does not compare content, and second that it does not limit itself to a particular format.

In a matter of minutes I figured out a reasonable method of generating hashes without ever looking at 'content' (merely stripping out the metadata portion):
unzip -qvl epubname.epub -x *.opf [& other metadata-y files] | cut -c 49-56 | sort | md5 [ this takes the crc32 values of each file in zip except those listed, sorts so the crc32s are in a known order, and generates a md5. Works perfectly for identifying things without ever unpacking the file. Should be lighting fast to generate.]

As for a 'particular format', there are plenty of 'epub' only plugins, or items that only work on epubs. Such as Your addition here: https://www.mobileread.com/forums/sho...&postcount=482

In this case, don't generate hashes if it's not appropriate (I wonder if a similar method for mobis/etc would work though... ideas?)

Quote:

For the majority of users, books will be identified by some aspect of their metadata as being duplicates.

And I'm telling you very clearly: I've seen files where the metadata was the same, yet the book was different (generated differently, different image sizes, for example), and vice versa, where metadata differences hide that the book contents are identical, and merely based on who created it when and how.

Quote:

I reserve the right to say no to ones I don't think fit well

Good luck with whatever solution you pursue.

Of course you do... I understand that... I hope someone else who is interested in this steps up. Seems like a pretty simple plugin (perhaps in conjunction with Find Duplicates: populate a metadata column with this hash for all items, then use your plugin to remove one of the items with the matching/duplicate hash.)