MobileRead Forums - View Single Post

jluaioyj · 09-08-2022, 02:17 PM

I'd like to suggest that Find Duplicates's Binary Compare be enhanced to support storage and comparison by hash/digest to significant speed it up, by avoiding most/all redundant full file compares. A file compare on hash/digest could be an option if hash/digest collision false-matches are suspected, as were discovered occurring sometimes for MD5.

If suggest storage of binary comparison metadata in a custom field, and that this field contains a JSON map with a hash/digest type, and a key for each file format, with the value map containing, a hexadecimal formatted file hash/digest (I'd suggest SHA256), the last file size, and the file last modified timestamp; the later two for validation.

If this custom field was not configured, like the "Last Modified" plugin does, a warning should be displayed, then the current _slow_ full file compare functionality used instead.

During a binary search:
* If any of the field maps are missing, they should be created.
* If the field value is junk or if the hash/digest type is obsolete, the whole field map should be recreated.
* If a format file is missing, it should be removed from the field map.
* If a format file was added, it should be added to the map.
* If the hash/digest value is out-of-date (file size or last modified changed), the type map should be re-built.

It would be a nice if the above rules were applied after a book entry was added, after any formats were added/removed, and after any in-calibre format file changes; obviously, I would not expect this to be able to spot any updates outside of calibre.

I'd suggest that this field, and its value creation, validation, and updating should really be provided by calibre itself.

09-08-2022, 02:17 PM	#968
jluaioyj Junior Member Posts: 9 Karma: 10 Join Date: Sep 2022 Device: Kobo Touch, Like Book Mars, Android	Optimise Binary Compare using hashes/digests I'd like to suggest that Find Duplicates's Binary Compare be enhanced to support storage and comparison by hash/digest to significant speed it up, by avoiding most/all redundant full file compares. A file compare on hash/digest could be an option if hash/digest collision false-matches are suspected, as were discovered occurring sometimes for MD5. If suggest storage of binary comparison metadata in a custom field, and that this field contains a JSON map with a hash/digest type, and a key for each file format, with the value map containing, a hexadecimal formatted file hash/digest (I'd suggest SHA256), the last file size, and the file last modified timestamp; the later two for validation. If this custom field was not configured, like the "Last Modified" plugin does, a warning should be displayed, then the current _slow_ full file compare functionality used instead. During a binary search: * If any of the field maps are missing, they should be created. * If the field value is junk or if the hash/digest type is obsolete, the whole field map should be recreated. * If a format file is missing, it should be removed from the field map. * If a format file was added, it should be added to the map. * If the hash/digest value is out-of-date (file size or last modified changed), the type map should be re-built. It would be a nice if the above rules were applied after a book entry was added, after any formats were added/removed, and after any in-calibre format file changes; obviously, I would not expect this to be able to spot any updates outside of calibre. I'd suggest that this field, and its value creation, validation, and updating should really be provided by calibre itself. Last edited by jluaioyj; 09-08-2022 at 02:22 PM.