MobileRead Forums - View Single Post - Calibre and bit-rot

chaley · 09-10-2010, 04:25 PM

Problem 1: whatever checks the hash must know when to regenerate it. Calibre doesn't know when I edit an epub, when a viewer drops bookmarks in, or when other operations take place legitimately change the file. The user might know, though.

Problem 2: telling me that a file is already corrupt is too late. I want the file repaired. Knowing that isn't going to happen, I keep one set of backups on a RAID disk, and another set on DVD. You will now note that I need to know to go get the backup. That takes us to ...

Mitigation 1: epub (at least) is in fact zip, which is internally protected by checksums. I think that mobi is as well. Such filetypes are easily scanned using existing tools.

Mitigation 2: you can do this today using external tools and calibre's command line. For example, make a custom column called sha1. Use whatever tool you wish to compute the SHA1s of all the files for a book, saving the output as a long string. Use calibredb set_custom to write that string into the database. Use calibredb list to extract that string and compare the hashes. For example, on linix you could use sha1sum to generate a set of hashes, and sha1sum --check to verify those hashes. Altermatively, simpler, and not requiring a custom column, periodically run checksum compares against a stored checksum list. From time to time generate the list (such as when things change). At whatever frequency you want, check the sums.

Comment 1: I am not convinced that I want calibre to be involved in archival issues like this. First, archive verification is a personal thing, touching backup schemes and personal preferences. Second, calibre changes very quickly, and compatibility difficulties will certainly arise. Third, development and maintenance would be taxing for a small team of volunteers.

Comment 2: It should be possible for an interested party to build some tools that run along side calibre. The techniques mentioned above could be used, or perhaps others.

09-10-2010, 04:25 PM	#4
chaley Grand Sorcerer Posts: 11,741 Karma: 6997045 Join Date: Jan 2010 Location: Notts, England Device: Kobo Libra 2	Problem 1: whatever checks the hash must know when to regenerate it. Calibre doesn't know when I edit an epub, when a viewer drops bookmarks in, or when other operations take place legitimately change the file. The user might know, though. Problem 2: telling me that a file is already corrupt is too late. I want the file repaired. Knowing that isn't going to happen, I keep one set of backups on a RAID disk, and another set on DVD. You will now note that I need to know to go get the backup. That takes us to ... Mitigation 1: epub (at least) is in fact zip, which is internally protected by checksums. I think that mobi is as well. Such filetypes are easily scanned using existing tools. Mitigation 2: you can do this today using external tools and calibre's command line. For example, make a custom column called sha1. Use whatever tool you wish to compute the SHA1s of all the files for a book, saving the output as a long string. Use calibredb set_custom to write that string into the database. Use calibredb list to extract that string and compare the hashes. For example, on linix you could use sha1sum to generate a set of hashes, and sha1sum --check to verify those hashes. Altermatively, simpler, and not requiring a custom column, periodically run checksum compares against a stored checksum list. From time to time generate the list (such as when things change). At whatever frequency you want, check the sums. Comment 1: I am not convinced that I want calibre to be involved in archival issues like this. First, archive verification is a personal thing, touching backup schemes and personal preferences. Second, calibre changes very quickly, and compatibility difficulties will certainly arise. Third, development and maintenance would be taxing for a small team of volunteers. Comment 2: It should be possible for an interested party to build some tools that run along side calibre. The techniques mentioned above could be used, or perhaps others.