What format to store books in? What software to read them with? - Page 4

kovidgoyal · 12-31-2007, 11:55 PM

You're making all this waay to complex, and you're putting the cart before the horse. Before worrying about how you're going to convert from your "base" format into other formats, worry about how you're going to convert all existing ebooks in various formats into your base format. Once you have all your books in a format, provided it is reasonably well designed, writing a converter into any other format is the work of perhaps a couple of days.

nairbv · 01-01-2008, 01:25 AM

I disagree.

back to my imaginary file format analogy.

Let's say I come up with a "standard" that's either a txt file, or an rtf, or html, or xml file, or doc file, and I put it in a zip file and change the file name.

This would be a very easy format to convert to (put anything in a zip file and change the name), but would be a pain in the ass to convert from, and would essentially be useless. You'd never know for sure whether or not you were getting the correct semantic data (which might or might not be stored in an xml version, or might be in html meta data, or might also just be in the first few lines of a text file), and your conversion would be riddled with if statements, essentially just being the sum total all of the separate conversion programs that convert from any other file format.

There's no sense in bothering to convert files into ONE file format if it's not actually ONE file format.

Likewise, it's inherently not possible for me to write a good converter that would convert from a file type that potentially stored the same semantic data in multiple places. I'd have no way to programatically know which one was correct, when data was found in more than one spot. A file type with such a nature, as far as I'm concerned, throws away information.

Maybe epub isn't as bad as my example, but as far as I can tell it does "support" dtbook, xhtml, and xml content in varying confusing ways, as well as having the ability to store the same semantic data in multiple places.

so... I'm not just going to choose whatever's easiest to convert to. My priority is to store all the information in the way that's best for getting the information out, not in. If it's difficult to get information in, then that inherently demonstrates the faulty nature of the format I'm trying to convert from, not a fault of the format I'm converting to.

Lexicon · 01-01-2008, 10:38 AM

nairbv,

Here is the situation, as far as I can tell by reading the specifications, regarding epub:

epub is a container format, within that container the text content is encoded as XML, that XML can be in one of two XML dialects, XHTML and DTBook. Normally you would use one or the other to hold the text, not both. CSS is used to describe how the XML is formatted for display.

For a reading system to be conformant with the spec it must support XHTML, DTBook and CSS. See this section of the specification:

Quote:

1.4.2: Reading System Conformance

This specification defines only one level of conformance for a Reading System. A Reading System is conformant if and only if it processes documents as follows:

When presented with an OPS Content Document the Reading System must:

1. correctly process the XML as required in the XML 1.1 specification, including that specification’s requirements for the handling of well-formedness errors; and
2. recognize all markup described as permitted in this specification and processes it consistently with the corresponding explanations in this specification and in those of XHTML 1.1, CSS 2, and DTBook (in case of any conflict, this specification takes precedence); and
3. not render img or object elements of unsupported media types, in the absence of fallbacks. These fallbacks are clearly defined herein — img in Section 2.3.4 and object in Section 2.3.6; and
4. verify the existence of the appropriate namespace specifications, as defined in the Relationship to XML Namespaces section above.

It is certainly possible for a company to release a reading system than only understands XHTML and CSS but such a system would not be epub compliant and the manufacturer would (hopefully) be subject to the same kind of scorn and criticism that Microsoft is when it releases web browsers that don't fully support the current HTML spec.

If simply supporting XHTML and CSS constitutes an epub reader then the bar is set pretty low. Indeed if we use that definition I imagine most of us already have an epub reader installed (Firefox can display XHTML and CSS for example).

So XHTML, DTBook and CSS support is required. Epub also allows other XML dialects to be used but a reading system is not required to support these additional dialects, instead epub provides a fallback method so the publisher can effectively say:

If this reading device supports TEI then display this page (which is tagged as TEI) otherwise display this other page (which is tagged as XHTML).

For a document to be epub compliant it must provide a fallback (i.e. a page in a format that a compliant reader can parse - XHTML or DTBook) for every piece of non-standard XML used in the document. The idea is that if Sony releases a reader that supports epub + MathML then you can write an epub document to take advantage of that expanded capability but which will also display fine on a different reading system that doesn't understand MathML.

Regarding DTBook borrowing from the HTML spec, it does, but only when it makes sense to do so. For instance they both use the paragraph tag, that's because a paragraph is a useful semantic unit when marking up books. There's no point re-inventing the wheel after all.

To attempt to answer one of your earlier questions; yes, it is possible to represent the content of a book entirely in DTBook XML. If you want it to be useful to most people then you should probably create a CSS file to go with it so it can be displayed in a visually attractive form. Once you have your DTBook and CSS files it should be a very easy step to package them together into an epub document for viewing on an epub compliant reading system.

kovidgoyal · 01-01-2008, 12:21 PM

nairbv as long as you choose one file format all the metadata will be stored consistently, because you will ensure that while converting to that format. Various formats zipped up is not a single file format, its various file formats zipped up.

So to re-iterate, you will need to write converters from various file formats to your single file format and these converters are going to have to be able to read metadata from all these various file formats as well as converting the content itself. That is the hard part. Writing a converter from some format (even epub) to any other format is trivial by comparison.

Now I've already substantially solved this problem in libprs (the only major ebook format I dont read metadata/convert is .mobi and I've chosen to store the metadata not in a file format but in a database and when you export files from the database, the metadata is written to an OPF file.

andyafro · 01-01-2008, 12:36 PM

I have a main library thats is in order (author, genre) but has loads of different formats (lit,doc,pdf ect) and i have a sony reader library thats in order i add books to it as i convert them to lrf,iwould do it that way lot of timeto waste on a book if you not ready to read it

DaleDe · 01-01-2008, 02:14 PM

Quote:

Originally Posted by nairbv

ooh, I just found this:

"even though OPS 2.0 (which is inside the EPub container) supports DTBook well enough that a DTBook Publication can easily be made to conform to OPS 2.0, this does not mean that EPub requires support for the unique features in DTBook. The only “DTBook” requirement is that all OPS 2.0 Publications must include the DTBook NCX, which is the machine-readable table of contents."

from http://www.teleread.org/blog/2007/11...point-readers/

So, epub doesn't really support DTBook, it just uses the same kind of table of contents? What's "can be made to conform" mean? like, DTBook's spec could be tweaked a little? or a rigidly made DTBook could as-is meet OPS requirements? Does that just mean not using any of the dtbook-specific tags so that it was just semantics-free html? I'm assuming a DTBook doesn't start with an "HTML" tag or anything like that though. Maybe they mean calling it "out of line" xml (ops 2.0 #1.4.1.4), ignoring the semantic data in the file, and including a link to a transforming stylesheet? Doesn't sound pretty either way.

but then later I see "DTBook is valid markup for use as content (along with XHTML)"

Then again from here it also looks like content is an "either/or" of html or dtbook (or maybe xml .. + stylesheet?):
http://www.idpf.org/2007/ops/OPS_2.0...Section1.4.1.1

they also talk a lot about CSS in this spec, but seem to be referring to it as xml stylesheets. I thought xml stylesheets where xsl, not css? or are they talking about xsl?

and then in 2.6.2.3.1 I see "If the Reading System is capable of processing the document format of chapter2.xml then the link resolves to chapter2.xml. Otherwise, the link resolves to the fallback for chapter2.xml, which is chapter2.html" .... so, ... yeah, ... a bunch of if statements to find your content, based on which style of content the particular reader has implemented a way to render. This all seems pretty lame to me.

It can support DTBook but this is a different mimetype so they can't be mixed in the same book. Although some features can be mixed if they can coexist. Like Digital Editions not all ePUB documents are the same type which is why the mimetype is so important and required to be in clear text at the begining of the file

Dale

nairbv · 01-02-2008, 01:44 AM

lexicon:

conversion software is essentially reader software,. .. so what you're saying confirms my point. to write a valid conforming from-epub converter, I have to be able to parse at least two different kinds of content files.

kovidgoyal:

sure, if I write the converter that converts to epub, I can control where I put my metadata, but if I'm storing all my files as epub and I get one that's already epub, I'm not going to have any idea if someone else consistently put the same metadata in every place that metadata could possibly be stored. I might also be inclined to use someone elses conversion software when/if it exists. If I find conflicting metadata in some file, I might not even have a programatic way to guess things like what the correct title is.

I'd have to solve this problem in a epub->dtbook converter anyways, but why add to the problem?

likewise if I turn a DTBook into an epub book, by essentially putting it in a zip and renaming it, I know *less,* not more about what I have in that zip file, since hence-forth it might be a DTBook and might be XHTML. That seems to only add confusion, so I think I'd rather just keep books in DTBook format if I can get them there. If I decide later that epub is a better format than DTBook, ... conversion from DTBook should be as trivial as a few lines of shell script.

since DTBook practically is epub, it shouldn't be significantly more difficult to convert to dtbook than to epub. The only added difficulty is that I'm constrained to converting to one format, instead of sporadically converting to either of two formats. I'd rather convert consistently to one format than either of two. If I have two formats of files and for consistency convert them into something that might be one or might be the other of two other file formats, .... that just sounds like an exercise in futility.

kovidgoyal · 01-02-2008, 11:56 AM

Umm from what I remember of the epub spec, an epub zip file stores metadata in only one place (an opf file) and a well formed epub document should specify whether it contains dtbook or xhtml. In any case that is really easy to detect by just look at the XML headers.

DaleDe · 01-02-2008, 12:00 PM

Quote:

Originally Posted by kovidgoyal

Umm from what I remember of the epub spec, an epub zip file stores metadata in only one place (an opf file) and a well formed epub document should specify whether it contains dtbook or xhtml. In any case that is really easy to detect by just look at the XML headers.

The mimetype tells you all you need to know and it is clear text at the beginning of the zip file. You don't need to unzip it as the mimetype is not compressed. Cat, more, head will do the trick or just open the file and read it starting in byte 30.

Dale

recycledelectron · 01-02-2008, 08:28 PM

I'm a genius.

Quote:

Originally Posted by recycledelectron

I dropped DOC, XLS, MDB, etc when Microsoft moved to DOCX, XLSX, etc because Microsoft's data formats are too ephemeral for long-term use.

Take a look at: http://it.slashdot.org/it/08/01/01/137257.shtml
"Office 2003 Service Pack Disables Older File Formats"

Basically, older file formats are removed automatically from Office 2003, unless you go through an unworkable work-around.

That's why it is critical to standardize on file formats, and keep your files in only those formats.

Andy

nairbv · 01-03-2008, 01:00 AM

@recycledelectron:

So what's your opinion on epub? ... a file format that is sometimes a zip file containing XHTML, and is sometimes a zip file containing a DTBook? ... and then maybe in addition an "it's preferred if you use this xml document if you know how to parse it" other option?

A file format who's rendering will be handled by css when displayed in a web browser, but by an adobe specific file called page-template.xpgt when displayed by the primary currently existing "epub compliant" software.

@kovidgoyal:

Sure, it *should* go in the opf file. ... but if converted from html, the metadata will probably also be in the html file. if converted from a dtbook, it will probably also be in the dtbook. if converted lazily, which will often enough be the case, it might not have been copied into the opf file.

When converting from epub to html, most people will just pull out the html file and think "i'm done," ... and thus ideal epub authoring software would put the data in both places when creating the epub file initially. Often enough, buggy software cuts off some string somewhere at some number of characters, and so even just minor things like sporadic poorly written software will mean that two versions of metadata won't match.

*Good* reader software would probably check html and/or dtbook metadata when it fails to find all metadata in the opf file... since, after all, it might be there, and why miss data?

I see these as unnecessary complications introduced by a poorly thought out design. I'm just saying that I would prefer a solution that only stores semantic data once. For me, much of the point of moving to a single format is to reduce redundancy. If the file I'm converting to maintains redundancy, then there's no reason for me to bother.

recycledelectron · 01-04-2008, 02:27 AM

Quote:

Originally Posted by nairbv

@recycledelectron:
So what's your opinion on epub?

I've never heard of it until now, and certainly will not be looking into it until I see thousands of my documents showing up in that format. It's kinda like if you ask me what I think of the claim that Larry Niven is God. I've never considered it, and will not be going on an extended spiritual quest unless I see lots of believers (not a few guys on MobileRead.com) and lots of evidence. No offense.

Ya'll have pointed out a few other great points on a library.

1. Redundancy, unless your files are expendable.

2. Organize them by some method. Title, Author, Subject (my pick,) etc.

3. Fill in the meta data, if you can. I have neglected this, and sincerely regret this now that I have almost 1TB and a Sony PRS-505 that only reads the meta-data title. Author, Title, etc are criitcal.

4. Keep your collection off line. There is always someone who will claim a copyright on a 75 year old book that's been out of print for decades. They will refuse to print it, it will be unavailable from rare book dealers, and you will find yourself sued and your hard disk wiped if you have a copy in your cache somewhere.

Andy

01-01-2008, 12:21 PM	#49
kovidgoyal creator of calibre Posts: 46,355 Karma: 29630884 Join Date: Oct 2006 Location: Mumbai, India Device: Various	nairbv as long as you choose one file format all the metadata will be stored consistently, because you will ensure that while converting to that format. Various formats zipped up is not a single file format, its various file formats zipped up. So to re-iterate, you will need to write converters from various file formats to your single file format and these converters are going to have to be able to read metadata from all these various file formats as well as converting the content itself. That is the hard part. Writing a converter from some format (even epub) to any other format is trivial by comparison. Now I've already substantially solved this problem in libprs (the only major ebook format I dont read metadata/convert is .mobi and I've chosen to store the metadata not in a file format but in a database and when you export files from the database, the metadata is written to an OPF file. Last edited by kovidgoyal; 01-01-2008 at 12:24 PM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Can a JBL read Fictionwise's multi-format books?	GA Russell	Ectaco jetBook	16	06-01-2010 10:32 PM
Looking for reading software on Android that will read Epub format	CJBarrow	Reading and Management	1	04-14-2010 03:28 PM
can we read books from the sony store ( or formerly sony store) and read them in the	SDRebel	Astak EZReader	27	01-22-2010 01:27 AM
Buuying books on the amazon store to read on Sony prs-505	Mayr	Sony Reader	3	10-08-2009 03:10 AM
What format do you like to "Store" your books in?	askyn	Workshop	11	10-16-2008 01:22 PM

12-31-2007, 11:55 PM	#46
kovidgoyal creator of calibre Posts: 46,355 Karma: 29630884 Join Date: Oct 2006 Location: Mumbai, India Device: Various	You're making all this waay to complex, and you're putting the cart before the horse. Before worrying about how you're going to convert from your "base" format into other formats, worry about how you're going to convert all existing ebooks in various formats into your base format. Once you have all your books in a format, provided it is reasonably well designed, writing a converter into any other format is the work of perhaps a couple of days.

01-01-2008, 01:25 AM	#47
nairbv Connoisseur Posts: 88 Karma: 15 Join Date: Nov 2007 Device: still looking for an ebook reader device	I disagree. back to my imaginary file format analogy. Let's say I come up with a "standard" that's either a txt file, or an rtf, or html, or xml file, or doc file, and I put it in a zip file and change the file name. This would be a very easy format to convert to (put anything in a zip file and change the name), but would be a pain in the ass to convert from, and would essentially be useless. You'd never know for sure whether or not you were getting the correct semantic data (which might or might not be stored in an xml version, or might be in html meta data, or might also just be in the first few lines of a text file), and your conversion would be riddled with if statements, essentially just being the sum total all of the separate conversion programs that convert from any other file format. There's no sense in bothering to convert files into ONE file format if it's not actually ONE file format. Likewise, it's inherently not possible for me to write a good converter that would convert from a file type that potentially stored the same semantic data in multiple places. I'd have no way to programatically know which one was correct, when data was found in more than one spot. A file type with such a nature, as far as I'm concerned, throws away information. Maybe epub isn't as bad as my example, but as far as I can tell it does "support" dtbook, xhtml, and xml content in varying confusing ways, as well as having the ability to store the same semantic data in multiple places. so... I'm not just going to choose whatever's easiest to convert to. My priority is to store all the information in the way that's best for getting the information out, not in. If it's difficult to get information in, then that inherently demonstrates the faulty nature of the format I'm trying to convert from, not a fault of the format I'm converting to.

01-01-2008, 12:36 PM	#50
andyafro Connoisseur Posts: 98 Karma: 140 Join Date: Jun 2007 Device: sony reader prs-500	I have a main library thats is in order (author, genre) but has loads of different formats (lit,doc,pdf ect) and i have a sony reader library thats in order i add books to it as i convert them to lrf,iwould do it that way lot of timeto waste on a book if you not ready to read it

01-02-2008, 01:44 AM	#52
nairbv Connoisseur Posts: 88 Karma: 15 Join Date: Nov 2007 Device: still looking for an ebook reader device	lexicon: conversion software is essentially reader software,. .. so what you're saying confirms my point. to write a valid conforming from-epub converter, I have to be able to parse at least two different kinds of content files. kovidgoyal: sure, if I write the converter that converts to epub, I can control where I put my metadata, but if I'm storing all my files as epub and I get one that's already epub, I'm not going to have any idea if someone else consistently put the same metadata in every place that metadata could possibly be stored. I might also be inclined to use someone elses conversion software when/if it exists. If I find conflicting metadata in some file, I might not even have a programatic way to guess things like what the correct title is. I'd have to solve this problem in a epub->dtbook converter anyways, but why add to the problem? likewise if I turn a DTBook into an epub book, by essentially putting it in a zip and renaming it, I know less, not more about what I have in that zip file, since hence-forth it might be a DTBook and might be XHTML. That seems to only add confusion, so I think I'd rather just keep books in DTBook format if I can get them there. If I decide later that epub is a better format than DTBook, ... conversion from DTBook should be as trivial as a few lines of shell script. since DTBook practically is epub, it shouldn't be significantly more difficult to convert to dtbook than to epub. The only added difficulty is that I'm constrained to converting to one format, instead of sporadically converting to either of two formats. I'd rather convert consistently to one format than either of two. If I have two formats of files and for consistency convert them into something that might be one or might be the other of two other file formats, .... that just sounds like an exercise in futility.

01-02-2008, 11:56 AM	#53
kovidgoyal creator of calibre Posts: 46,355 Karma: 29630884 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Umm from what I remember of the epub spec, an epub zip file stores metadata in only one place (an opf file) and a well formed epub document should specify whether it contains dtbook or xhtml. In any case that is really easy to detect by just look at the XML headers.

01-03-2008, 01:00 AM	#56
nairbv Connoisseur Posts: 88 Karma: 15 Join Date: Nov 2007 Device: still looking for an ebook reader device	@recycledelectron: So what's your opinion on epub? ... a file format that is sometimes a zip file containing XHTML, and is sometimes a zip file containing a DTBook? ... and then maybe in addition an "it's preferred if you use this xml document if you know how to parse it" other option? A file format who's rendering will be handled by css when displayed in a web browser, but by an adobe specific file called page-template.xpgt when displayed by the primary currently existing "epub compliant" software. @kovidgoyal: Sure, it should go in the opf file. ... but if converted from html, the metadata will probably also be in the html file. if converted from a dtbook, it will probably also be in the dtbook. if converted lazily, which will often enough be the case, it might not have been copied into the opf file. When converting from epub to html, most people will just pull out the html file and think "i'm done," ... and thus ideal epub authoring software would put the data in both places when creating the epub file initially. Often enough, buggy software cuts off some string somewhere at some number of characters, and so even just minor things like sporadic poorly written software will mean that two versions of metadata won't match. Good reader software would probably check html and/or dtbook metadata when it fails to find all metadata in the opf file... since, after all, it might be there, and why miss data? I see these as unnecessary complications introduced by a poorly thought out design. I'm just saying that I would prefer a solution that only stores semantic data once. For me, much of the point of moving to a single format is to reduce redundancy. If the file I'm converting to maintains redundancy, then there's no reason for me to bother.

Advert

Advert