View Single Post
Old 11-09-2007, 03:44 AM   #93
jbenny
Addict
jbenny has a complete set of Star Wars action figures.jbenny has a complete set of Star Wars action figures.jbenny has a complete set of Star Wars action figures.jbenny has a complete set of Star Wars action figures.
 
Posts: 323
Karma: 358
Join Date: May 2007
Device: Tablet PC and Nokia N800
Quote:
Originally Posted by sartori View Post
I've been thinking about the issue of identifying which version of a document you may be looking at while researching. For example:

Say I quote chapter 3, paragraph 11 from a book listed on site 1 that is listed as Alice In Wonderland.epub. Somebody looking at my work decides to lookup the quote from a document called Alice In Wonderland.epub on site 2. The only problem is site 2 has marked the paragraph starting point incorrectly so my reference makes no sense.

I have read that each epub document (and probably most others) require an ID number. Could this ID number be a 10 digit checksum generated from the actual content of the html source? That way, even if one character is changed in the source the checksum would change.

Then when I reference my quote it could be something like Chapter 3, Paragraph 11 - Alice In Wonderland.epub [5684937643]. It should be pretty easy to create a tool that would verify the checksum I typed. Now I could verify any document as being the same one originally referenced no matter where the file was obtained.

Edit: Of course this does nothing to help verify that the document I quoted from was correct in the first place.

Rob
You are refering to the "identifier" in an epub, which is one of three required metadata elements in an epub (title and language are the other two). There are several other metadata elements which are optional. In the following example, the place where I have put x's is where the identifier would go. This example is from the "content.opf". The same identifier also goes in the "toc.ncx", using a different statement.
<dc:identifier id="BookID">urn:uuid:xxxxxxxxxxxxxxxxxxx</dc:identifier>
Note that the identifier is required to be unique, such that no other epub should have the same ID.

For a commercial ebook, the identifier would be the ISBN. For ebooks without an assigned ISBN, some other means of identifying the ebook is needed. Unless I missed it in the OPS specification, I don't see that it recommends any particular method. However, a UUID (GUID) seems to be the most logical solution, as discussed elsewhere on this forum (and the format of the above statement even implies the use of a UUID). Feedbooks is using a UUID for epubs, according to Hadrien.

Assuming that a new ISBN or UUID is used whenever an edited or updated version of the original epub is created, this would take care of identifying a particular edition.

The identifier would seem to preclude using it as a checksum, due to the need for uniqueness. However, one of the optional metadata fields may be useable for such use. In fact, I don't see anything that says you can't use your own unique metadata element for this purpose. Of course, getting everyone to use such a method is another issue.

Adding a checksum (or better, a hash) would be a useful addition to the epub specification. You could certainly use it to verify that the contents haven't changed, as you suggested. Again, this may not be important for the casual reader, but people need to think about and find ways to accomodate ebook use by the academic community as well.
jbenny is offline   Reply With Quote