MobileRead Forums - View Single Post - What format to store books in? What software to read them with?

nairbv · 12-31-2007, 10:04 PM

hmm...

but epub is more than DTBook+CSS,... it also has XHTML files, which separately store the text of the book.

My impression though is that DTBook ALSO stores the text of the book, and some amount of presentation information. I've read that the DTBook element set "borrows heavily" from the HTML spec (which doesn't bother me, I just don't like all-out, separate, semantics-free HTML files that include all their own ways of representing meta-data etc, and then have to be marked up by external XML to hackily tack on book related semantic data).

I mean, someone could come up with a "standard" that with some funny file extension essentially a renamed zip file containing TXT or html or rtf file.... and expect all the readers to use if statements. That would have the ability to preserve lots of information and give the authors lots of control, .. and would "draw from existing standards," but ... it wouldn't be a reasonable standard in itself. It wouldn't be anything useful at all.

So, if epub is DTBook (which contains the text of a book in an xml-internal html-like format) + the epub XHTML files (which are a separate subset of the HTML spec, stored in individual one-per-chapter files) + epub's XML markup + css all being zipped together... How does that work? implement both and use "if" statements? Reader software implements the part of the spec necessary for the publishers they have contracts with? Or am I miss-understanding the way they "draw from" these multiple standards?

Can a book be entirely represented as a DTBook or not? kovidgoyal says it won't hold presentation information on it's own, but if not then why would it borrow heavily from the HTML spec? Searching around I just found a microsoft word plugin that generates DTBooks. If it's ONLY semantic information then how would word generate a book? I wish I had word now to try it out.

So maybe DTBook is the format I should be storing all my books in. Are there converters that convert to DTBook? It seems like DTBook would be the simplest format to convert FROM from what I see so far, and that's what I'm really concerned with in a "base" format. There's no sense in converting to a format (or spending time adding semantic information to a new file in a particular format) if I can't easily convert from that format.

Converting from a semantically-rigid xml file (regardless of how much html-like markup it has) will be easier in my mind than converting from a mess of real html files + hacked on external xml markup. Parsing a pile of files and trying to guess where the particular author put the semantic information, (or maybe even guessing whether it's a DTBook or HTML file holding the content??)... I just wont put in that kind of effort. Whereas, parsing a semantic XML file with content that's been marked up a bit with html tags, ... for that I could learn xsl and write a basic transformer in a day. Even if DTBook isn't fully implemented in epub, it seems it would be painless to convert from DTBook to epub without losing any information, .. from what I see so far anyways.

12-31-2007, 10:04 PM	#44
nairbv Connoisseur Posts: 88 Karma: 15 Join Date: Nov 2007 Device: still looking for an ebook reader device	hmm... but epub is more than DTBook+CSS,... it also has XHTML files, which separately store the text of the book. My impression though is that DTBook ALSO stores the text of the book, and some amount of presentation information. I've read that the DTBook element set "borrows heavily" from the HTML spec (which doesn't bother me, I just don't like all-out, separate, semantics-free HTML files that include all their own ways of representing meta-data etc, and then have to be marked up by external XML to hackily tack on book related semantic data). I mean, someone could come up with a "standard" that with some funny file extension essentially a renamed zip file containing TXT or html or rtf file.... and expect all the readers to use if statements. That would have the ability to preserve lots of information and give the authors lots of control, .. and would "draw from existing standards," but ... it wouldn't be a reasonable standard in itself. It wouldn't be anything useful at all. So, if epub is DTBook (which contains the text of a book in an xml-internal html-like format) + the epub XHTML files (which are a separate subset of the HTML spec, stored in individual one-per-chapter files) + epub's XML markup + css all being zipped together... How does that work? implement both and use "if" statements? Reader software implements the part of the spec necessary for the publishers they have contracts with? Or am I miss-understanding the way they "draw from" these multiple standards? Can a book be entirely represented as a DTBook or not? kovidgoyal says it won't hold presentation information on it's own, but if not then why would it borrow heavily from the HTML spec? Searching around I just found a microsoft word plugin that generates DTBooks. If it's ONLY semantic information then how would word generate a book? I wish I had word now to try it out. So maybe DTBook is the format I should be storing all my books in. Are there converters that convert to DTBook? It seems like DTBook would be the simplest format to convert FROM from what I see so far, and that's what I'm really concerned with in a "base" format. There's no sense in converting to a format (or spending time adding semantic information to a new file in a particular format) if I can't easily convert from that format. Converting from a semantically-rigid xml file (regardless of how much html-like markup it has) will be easier in my mind than converting from a mess of real html files + hacked on external xml markup. Parsing a pile of files and trying to guess where the particular author put the semantic information, (or maybe even guessing whether it's a DTBook or HTML file holding the content??)... I just wont put in that kind of effort. Whereas, parsing a semantic XML file with content that's been marked up a bit with html tags, ... for that I could learn xsl and write a basic transformer in a day. Even if DTBook isn't fully implemented in epub, it seems it would be painless to convert from DTBook to epub without losing any information, .. from what I see so far anyways.