View Single Post
Old 09-08-2022, 08:20 PM   #969
capink
Wizard
capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.
 
Posts: 1,209
Karma: 1995558
Join Date: Aug 2015
Device: Kindle
Quote:
Originally Posted by jluaioyj View Post
I'd like to suggest that Find Duplicates's Binary Compare be enhanced to support storage and comparison by hash/digest to significant speed it up, by avoiding most/all redundant full file compares. A file compare on hash/digest could be an option if hash/digest collision false-matches are suspected, as were discovered occurring sometimes for MD5.

If suggest storage of binary comparison metadata in a custom field, and that this field contains a JSON map with a hash/digest type, and a key for each file format, with the value map containing, a hexadecimal formatted file hash/digest (I'd suggest SHA256), the last file size, and the file last modified timestamp; the later two for validation.

If this custom field was not configured, like the "Last Modified" plugin does, a warning should be displayed, then the current _slow_ full file compare functionality used instead.

During a binary search:
* If any of the field maps are missing, they should be created.
* If the field value is junk or if the hash/digest type is obsolete, the whole field map should be recreated.
* If a format file is missing, it should be removed from the field map.
* If a format file was added, it should be added to the map.
* If the hash/digest value is out-of-date (file size or last modified changed), the type map should be re-built.

It would be a nice if the above rules were applied after a book entry was added, after any formats were added/removed, and after any in-calibre format file changes; obviously, I would not expect this to be able to spot any updates outside of calibre.

I'd suggest that this field, and its value creation, validation, and updating should really be provided by calibre itself.
I never use the binary comparison. I keep my metadata as clean as possible and this helps me more in discovering duplicates even if the files are not identical. However, a quick look at the code reveals that the Find Duplicates already does everything you are asking for; with some caveats:
  • The data for size, hash and mtime are stored using a calibre api method called add_multiple_custom_book_data(), instead of storing them in custom columns. This makes more sense and has the advantage of not burdening the user with unnecessary custom columns.

    N.B. If you are familiar with the Action Chains Plugin, you can use the chain attached below to see the data stored by the plugin (if any). (Action Chains > Add/Modify Chains > Right click the chain table > import)
  • When you run the binary check, before using any stored hash, the plugin first verifies it is not stale. If the hash is stale, it is re-calculated.
  • The plugin does not calculate and store hashes for all books. For sake of being economical, it only calculates hashes for group of formats that share the same size, which is bound to be a small subset of the formats in the library.

Given the last point, automatic hash calculation on book additions does not make much sense. It can be done but will not be of much use, because only a small subset of these hashes will be needed based on size comparisons. In addition to this, calculating book hashes will slow down adding books, especially if the user is adding a large number of books.
Attached Files
File Type: zip show_find_duplicates_data.zip (499 Bytes, 232 views)
capink is offline   Reply With Quote