Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre

Notices

Reply
 
Thread Tools Search this Thread
Old 09-10-2010, 03:59 PM   #1
BookGnome
Voracious Reader
BookGnome is on a distinguished road
 
BookGnome's Avatar
 
Posts: 4
Karma: 62
Join Date: Sep 2010
Device: Kindle
Calibre and bit-rot

I was thinking about something today, and as far as I can tell from looking at the database structure, Calibre has no protection against bit-rot of the e-books themselves. I'm not talking about database corruption; I'm talking about filesystems with bad sectors, where books might get silently corrupted on disk.

All Calibre seems to track is the uncompressed size of the book. Size alone isn't really much of a guarantee of file integrity.

It seems to me that if it's not doing so already, Calibre ought to store a hash of the book in the database to validate that the book hasn't been corrupted on disk. An SHA-1 or MD5 hash would probably be sufficient for the purpose.

In addition, it might be wise to store some recovery bits on the filesystem (e.g. par2 files, or some other variant of Reed-Solomon encoding) in order to be able to recover from modest amounts of on-disk corruption.

Integrity of the database itself is important, but I've seen enough disks get flakey--and bought enough e-books from vendors that don't allow re-downloads--that protection from bit-rot on the filesystem is important to me.

I thought I'd post about it here and see what others thought before posting wish-list items on the bug tracker. After all, maybe Calibre is already doing something smart about this, and I just don't know about it.

Comments? Suggestions? Additional thoughts?
BookGnome is offline   Reply With Quote
Old 09-10-2010, 04:22 PM   #2
Worldwalker
Curmudgeon
Worldwalker ought to be getting tired of karma fortunes by now.Worldwalker ought to be getting tired of karma fortunes by now.Worldwalker ought to be getting tired of karma fortunes by now.Worldwalker ought to be getting tired of karma fortunes by now.Worldwalker ought to be getting tired of karma fortunes by now.Worldwalker ought to be getting tired of karma fortunes by now.Worldwalker ought to be getting tired of karma fortunes by now.Worldwalker ought to be getting tired of karma fortunes by now.Worldwalker ought to be getting tired of karma fortunes by now.Worldwalker ought to be getting tired of karma fortunes by now.Worldwalker ought to be getting tired of karma fortunes by now.
 
Posts: 3,087
Karma: 722357
Join Date: Feb 2010
Device: PRS-505
Hmmm ... my big question is whether protection from bit rot is something that should be handled at the individual app level, or at the system level. I wouldn't want every program I owned doing its own (possibly incompatible) thing to protect the files it uses. That should either be a part of the OS or a single third-party app run on a regular basis.
Worldwalker is offline   Reply With Quote
Old 09-10-2010, 04:52 PM   #3
capidamonte
Not who you think I am...
capidamonte , Klaatu Barada Niktu!capidamonte , Klaatu Barada Niktu!capidamonte , Klaatu Barada Niktu!capidamonte , Klaatu Barada Niktu!capidamonte , Klaatu Barada Niktu!capidamonte , Klaatu Barada Niktu!capidamonte , Klaatu Barada Niktu!capidamonte , Klaatu Barada Niktu!capidamonte , Klaatu Barada Niktu!capidamonte , Klaatu Barada Niktu!capidamonte , Klaatu Barada Niktu!
 
capidamonte's Avatar
 
Posts: 346
Karma: 5337
Join Date: Jan 2010
Location: Honolulu
Device: Sony PRS-350
I like your idea BookGnome.

It's an excellent first post, too!

A simple hash seems like a very good idea, it's pretty low on system requirements and could certainly be part of the conversion/copy process in the library folder. That hash could be stored in the database. I wouldn't want it to necessarily be checked every time I access or send a book, though. I fear that it might make the interface a little laggy when manipulating multiple books -- but I could be wrong. I'd like to see something like a context menu option: "Check On-Disk File(s) Integrity" or some such that would check the hash for all the formats for a book.

Worldwalker, I have to disagree with you. There are at least three OS's (and multiple variants) that would all require different solutions and software -- when the hash could be pretty easily and invisibly integrated to Calibre.
capidamonte is offline   Reply With Quote
Old 09-10-2010, 05:25 PM   #4
chaley
"chaley", not "charley"
chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.
 
Posts: 5,902
Karma: 1216548
Join Date: Jan 2010
Location: France
Device: Many android devices
Problem 1: whatever checks the hash must know when to regenerate it. Calibre doesn't know when I edit an epub, when a viewer drops bookmarks in, or when other operations take place legitimately change the file. The user might know, though.

Problem 2: telling me that a file is already corrupt is too late. I want the file repaired. Knowing that isn't going to happen, I keep one set of backups on a RAID disk, and another set on DVD. You will now note that I need to know to go get the backup. That takes us to ...

Mitigation 1: epub (at least) is in fact zip, which is internally protected by checksums. I think that mobi is as well. Such filetypes are easily scanned using existing tools.

Mitigation 2: you can do this today using external tools and calibre's command line. For example, make a custom column called sha1. Use whatever tool you wish to compute the SHA1s of all the files for a book, saving the output as a long string. Use calibredb set_custom to write that string into the database. Use calibredb list to extract that string and compare the hashes. For example, on linix you could use sha1sum to generate a set of hashes, and sha1sum --check to verify those hashes. Altermatively, simpler, and not requiring a custom column, periodically run checksum compares against a stored checksum list. From time to time generate the list (such as when things change). At whatever frequency you want, check the sums.

Comment 1: I am not convinced that I want calibre to be involved in archival issues like this. First, archive verification is a personal thing, touching backup schemes and personal preferences. Second, calibre changes very quickly, and compatibility difficulties will certainly arise. Third, development and maintenance would be taxing for a small team of volunteers.

Comment 2: It should be possible for an interested party to build some tools that run along side calibre. The techniques mentioned above could be used, or perhaps others.
chaley is offline   Reply With Quote
Old 09-10-2010, 07:04 PM   #5
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
Using it for catching corrupted files would be of secondary interest, but I can see some value doing that.

Quote:
Originally Posted by chaley View Post
Problem 1: whatever checks the hash must know when to regenerate it. Calibre doesn't know when I edit an epub, when a viewer drops bookmarks in, or when other operations take place legitimately change the file. The user might know, though.
This sort of scenario would be the main reason I would want a feature like this. I don't really see it as a huge problem, just re-calculate the hash whenever the user performs a conversion/transfer/metadata manipulation of the book. Right now, if you convert from one format to another, and the destination format already exists, Calibre overwrites the destination format without any warning. If the destination format was one previously created by Calibre then no big deal, that means the user was just tweaking the conversion settings (or upgraded), and is attempting to create an improved conversion.

However, if I did go edit the file in Sigil or by hand then I don't want Calibre to silently destroy my work, which is exactly what happens today. It's extremely difficult to remember which files have been hand edited and which haven't. I've tried using a custom column to track this, but this is prone to user error. Using a checksum feature like this would allow Calibre to automatically recognize that the file was modified outside of Calibre since it was last manipulated. At that point the user could be warned and decide how to handle the situation.

Last edited by ldolse; 09-10-2010 at 07:08 PM.
ldolse is offline   Reply With Quote
Old 09-10-2010, 07:22 PM   #6
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 26,434
Karma: 5383257
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
This is not something I have personally thought about a lot. Perhaps I'm too young to fear bit rot Certainly, I have so far never lost any data (that I know of) to bit rot. So I can't contribute much to this discussion.

Simply maintaining a hash which is updated on legitimate operations is fairly trivial to do. The question is really what to do with that hash. ldolse's suggestion is one worthwhile use, though better, IMO, would be to add support for declaring a particular format as the "Master format". calibre would then ask for confirmation before running a conversion that would overwrite it.

As far as combating bitrot is concerned, I don't really see how the hash would help. After all, say calibre tells you that the file has changed (this would only happen if you ask calibre to check, for example during a db integrity check). Then what? The file has already "rotted" not much calibre can do about it. I suppose you could then go into your backups to try to find a pre-rotted version.
kovidgoyal is offline   Reply With Quote
Old 09-10-2010, 08:31 PM   #7
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
Quote:
Originally Posted by kovidgoyal View Post
Simply maintaining a hash which is updated on legitimate operations is fairly trivial to do. The question is really what to do with that hash. ldolse's suggestion is one worthwhile use, though better, IMO, would be to add support for declaring a particular format as the "Master format". calibre would then ask for confirmation before running a conversion that would overwrite it.
I've though about the "Master" option too - I think this would be a good approach as well, though in some ways slightly harder to implement - more GUI work to set it, same warning dialogs when doing conversion. The one problem with it is that it would require user to be aware of the feature and explicitly use it. Both could be implemented, as the Hash and Master functions would complement one another more than they would conflict. The thing I like about the hash is the transparency to the user.
ldolse is offline   Reply With Quote
Old 09-10-2010, 08:56 PM   #8
BookGnome
Voracious Reader
BookGnome is on a distinguished road
 
BookGnome's Avatar
 
Posts: 4
Karma: 62
Join Date: Sep 2010
Device: Kindle
Quote:
Originally Posted by kovidgoyal View Post
use, though better, IMO, would be to add support for declaring a particular format as the "Master format". calibre would then ask for confirmation before running a conversion that would overwrite it.
I really like this idea, because right now, Calibre treats all versions as equal even if they were gathered from different sources. Having a canonical version for a given book might mean a change in user work-flow (e.g. if you buy both an EPUB and a Mobipocket version of a book, you should store them as separate books).

Quote:
Originally Posted by kovidgoyal View Post
As far as combating bitrot is concerned, I don't really see how the hash would help. After all, say calibre tells you that the file has changed (this would only happen if you ask calibre to check, for example during a db integrity check). Then what? The file has already "rotted" not much calibre can do about it. I suppose you could then go into your backups to try to find a pre-rotted version.
Well, even if the only thing is did was notify you that things were hosed, that's better than not knowing. Then you could retrieve from backups, download again, or buy another copy--whatever was needed.

You'd also get something else for free with the hash: the ability to quickly identify exact duplicates in the database. You wouldn't have to rely on book metadata such as author, file size, or ISBN...if two copies have the same hash, then they're the same book. That doesn't mean other types of duplicate checking aren't useful, but finding two books with the same title by Piers Anthony wouldn't tell you whether both copies were byte-for-byte duplicates. Knowing that might make it easier for the to decide whether a book should be rejected as already in the database, or added anyway as an alternative version--without the hash, there's not really a great way to tell.

As for the repair issue, that's a secondary issue, but one that I think is solvable by optionally storing a small amount of recovery data (say, 10%) in the directory with each canonical copy of a book. You wouldn't need it for books you can regenerate from your known-good master copy, so the file system usage wouldn't actually grow by 10% overall--just a small percentage for each master format.

Whether Calibre shells out to par2create/par2repair, or somehow integrates some Reed-Solomon library directly (perhaps this Python library?) I think Calibre could easily add the ability to recover from file system corruption.

I definitely agree that identifying corruption, and preventing silent corruption (the worst kind, IMHO), is more important than fixing it. Knowing something is wrong always has to be the first step.
BookGnome is offline   Reply With Quote
Old 09-10-2010, 10:14 PM   #9
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 26,434
Karma: 5383257
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
It's certainly something that's worth looking into, so feel free to open a ticket. I can't promise it'll happen anytime soon as I already have a todo list longer than I am. So if someone wants to volunteer to code, I'll be happy to provide any needed guidance, otherwise it will be on the pile of things I will get to when I get to them...
kovidgoyal is offline   Reply With Quote
Old 09-11-2010, 07:40 AM   #10
GrzegorzN
Junior Member
GrzegorzN began at the beginning.
 
Posts: 9
Karma: 10
Join Date: Aug 2010
Device: Kindle 3
Quote:
Originally Posted by Worldwalker View Post
Hmmm ... my big question is whether protection from bit rot is something that should be handled at the individual app level, or at the system level.
I second that.

Calibre does not try to scan my PDF/DOC ebooks for viruses and macros, and Calibre does not try to defragment my hard drive to make book access faster. I don't think Calibre should try to offer me so much protection from hardware failures.

I don't think it's Calibre's "core business"... and there are definitely much better (dedicated & tested) tools for that sort of thing. If you want to protect your data and provide redundancy, use some form of RAID, separate media, maybe online storage, and/or existing archive/restore tools. If you're really serious about protecting your data, I think it makes sense to prefer established, supported and above all well and widely tested(!) solutions. The added benefit is that you're then able to protect not only ebooks, but also any other data you consider valuable.

BTW, I think the idea of a "master copy" (or similar) sounds good. I often start with one or two master documents, and then create specific formats from them (seems to be a common pattern for many users), and it might be useful for Calibre to know the difference between the "initial format" and "derived/transformed format". Also, master copies usually offer the best quality, so that might be a hint when picking the source format for conversion/device upload (although that might not be worth the effort).

Last edited by GrzegorzN; 09-11-2010 at 07:44 AM.
GrzegorzN is offline   Reply With Quote
Old 10-15-2010, 07:26 PM   #11
BookGnome
Voracious Reader
BookGnome is on a distinguished road
 
BookGnome's Avatar
 
Posts: 4
Karma: 62
Join Date: Sep 2010
Device: Kindle
Smile Using ebook_armor.sh to combat bit-rot

Quote:
Originally Posted by kovidgoyal View Post
As far as combating bitrot is concerned, I don't really see how the hash would help. After all, say calibre tells you that the file has changed (this would only happen if you ask calibre to check, for example during a db integrity check). Then what? The file has already "rotted" not much calibre can do about it. I suppose you could then go into your backups to try to find a pre-rotted version.
I thought about your question, and came up with an answer to the question of "then what?" I put together a shell script that uses a combination of things to track integrity, and also uses par2 to allow for recovery of data when damage is detected.

It turns out that some of the ebook formats (like epub or cbz, for example) give you some opportunities for integrity checking "for free" by providing archive integrity tests for their respective containers, in addition to any external hashing. Ultimately, I think the hash is more useful, but I tend to be a belt-and-suspenders kind of guy.

It's not a Calibre plugin, and the solution is very *nix-ish, but it may give you some ideas of your own for Calibre, if and when you decide to add some hash support. Let me know if you have any questions about the implementation, and I'll do what I can to help.
BookGnome is offline   Reply With Quote
Old 10-15-2010, 07:52 PM   #12
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 26,434
Karma: 5383257
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Cool, yeah i can see using par to repair damaged books.
kovidgoyal is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Windows 7 (64 bit) + Calibre 0.5.14 + crashing sherman Calibre 10 02-14-2010 09:40 PM
Calibre - Ever on 64 bit Vista or Windows 7? estral Calibre 16 08-22-2009 05:38 PM
Calibre and Windows 7 64 bit with Sony PRS-505 ChrisW Calibre 1 08-06-2009 08:14 AM
New versions of calibre in Ubuntu Hardy 64 Bit deedward9 Calibre 19 10-18-2008 12:21 AM
I can't get calibre to build/install on 64 bit linux angevin Calibre 8 10-08-2008 05:10 PM


All times are GMT -4. The time now is 05:59 AM.


MobileRead.com is a privately owned, operated and funded community.