MobileRead Forums - View Single Post - What format to store books in? What software to read them with?

nairbv · 12-26-2007, 07:11 PM

I'll clarify my requirements a bit...

My feelings on one-file=one-book: I understand that some formats are actually zip files containing a bunch of other files. This is fine, .. I just don't want to have to think about it excessively. If convertLit made a zip file (and maybe it didn't have a .zip extension) and there was some reading software that I open that zip file in to open/read/convert the book, great.

If I need to write a script that calls convertLit, zips the directory, and renames the zip-file as somebook.epub or something else, ... I can handle that, .. but I'm not sure exactly what the appropriate extension is, or why the convertLIT program doesn't do this for me, or what exactly the resulting file type is, or what programs I can use to convert and read that file type in the future.

kovidgoyal: you say "If you go the extra mile and make your zipped up HTML + OPF files into an .epub" ... so is HTML+OPF the thing that convertLIT makes? does "going the extra mile" just mean renaming the file as .epub, or is more involved? Are there programs like convertLIT to do this for pdb/mobi files, HTML, txt, etc? (obviously conversion from txt/html/etc wouldn't automatically add unknown meta data, but maybe would just create a stub that I could edit later... or maybe would have command line options to add the data if I"m converting a directory of books all by the same author).

Plain RTF, TXT, or HTML won't do, because I'm not willing to abandon the information stored in my original files. If I convert a .lit file to .txt or rtf and delete my .lit file, ... then I know that in the future when I want to read it, I won't be able to tell my device "go to chapter 23" and have it know what I mean. Pictures will be lost. Author/ISBN/publisher and any other metadata information stored in the original file will be lost. When I convert my books, I'm not willing to lose that much data. I think that eventually itunes / Photo Gallery type software will be out to organize ebooks based on tags, genres, authors, etc. I want to retain the tags so I can use those programs when they're available.

One more thing is that I can't really open every single file and export it individually through BD. ... I mean, ... it's hundreds of books. It'd just take too long.

Also, I still just don't really understand why they've created the open format the way they have. Can anyone explain to me why the open standard format uses HTML to store the books, as opposed to just using a simple XML format? What's the advantage of HTML that wouldn't be easier in XML? Display information? Does a book really need that much display information? I don't want a book full of div tags and javascript and width="someNumberBiggerThanMyDisplay." I want the txt, maybe in a defined file-format like unicode so I know all my books are the same, and I want something like structural XML to break up chapters, say where pictures go, and add meta data. Is there some advantage or necessity I'm not seeing here? I guess occasionally making little bits of text bold or italic might be an issue, but it's not one that HTML handles wonderfully either.

My understanding of HTML is that it's inherently structural data, with display information hacked on to make it pretty for the web. We try to take out the display information with stuff like CSS. The structural data communicated by HTML is web specific, and not so appropriate for ebooks, hence having to add an XML file to an ebook format. So what do we retain in HTML? Display information? something it's not that wonderfully suited to and which isn't incredibly relevant for ebooks anyways?

I AM willing to lose funny unnecessary formatting information that I would presumably select on my device anyways, ... stuff like font sizes etc.

I mean, it also seems like it would make sense if every time you converted to the "open-standard-format," ... the resulting file should be approximately the same. If you have to convert to HTML, then two different conversion programs are going to write very different HTML.

Maybe there are some crazy looking colorful fun kids books that need to look like the original paper book, and so it's easier to just use complete HTML to have that kind of support?

Also, why are there so many acronyms for this thing? It's so confusing... opf and ops and epub and oeb and oebps and idpf and ocf???? OK I understand that some might be organizations and some might be file extensions and some might be standards for various files in the zipped up file blah blah blah. ..... but cmon, are they just trying to confuse us? How is the average user supposed to figure out if he has the right kind of file?? and then a directory might be the right kind of "file" too?

sorry for the rant, ... I'm just trying to understand all this, ... and maybe miss understanding something involved?

12-26-2007, 07:11 PM	#18
nairbv Connoisseur Posts: 88 Karma: 15 Join Date: Nov 2007 Device: still looking for an ebook reader device	I'll clarify my requirements a bit... My feelings on one-file=one-book: I understand that some formats are actually zip files containing a bunch of other files. This is fine, .. I just don't want to have to think about it excessively. If convertLit made a zip file (and maybe it didn't have a .zip extension) and there was some reading software that I open that zip file in to open/read/convert the book, great. If I need to write a script that calls convertLit, zips the directory, and renames the zip-file as somebook.epub or something else, ... I can handle that, .. but I'm not sure exactly what the appropriate extension is, or why the convertLIT program doesn't do this for me, or what exactly the resulting file type is, or what programs I can use to convert and read that file type in the future. kovidgoyal: you say "If you go the extra mile and make your zipped up HTML + OPF files into an .epub" ... so is HTML+OPF the thing that convertLIT makes? does "going the extra mile" just mean renaming the file as .epub, or is more involved? Are there programs like convertLIT to do this for pdb/mobi files, HTML, txt, etc? (obviously conversion from txt/html/etc wouldn't automatically add unknown meta data, but maybe would just create a stub that I could edit later... or maybe would have command line options to add the data if I"m converting a directory of books all by the same author). Plain RTF, TXT, or HTML won't do, because I'm not willing to abandon the information stored in my original files. If I convert a .lit file to .txt or rtf and delete my .lit file, ... then I know that in the future when I want to read it, I won't be able to tell my device "go to chapter 23" and have it know what I mean. Pictures will be lost. Author/ISBN/publisher and any other metadata information stored in the original file will be lost. When I convert my books, I'm not willing to lose that much data. I think that eventually itunes / Photo Gallery type software will be out to organize ebooks based on tags, genres, authors, etc. I want to retain the tags so I can use those programs when they're available. One more thing is that I can't really open every single file and export it individually through BD. ... I mean, ... it's hundreds of books. It'd just take too long. Also, I still just don't really understand why they've created the open format the way they have. Can anyone explain to me why the open standard format uses HTML to store the books, as opposed to just using a simple XML format? What's the advantage of HTML that wouldn't be easier in XML? Display information? Does a book really need that much display information? I don't want a book full of div tags and javascript and width="someNumberBiggerThanMyDisplay." I want the txt, maybe in a defined file-format like unicode so I know all my books are the same, and I want something like structural XML to break up chapters, say where pictures go, and add meta data. Is there some advantage or necessity I'm not seeing here? I guess occasionally making little bits of text bold or italic might be an issue, but it's not one that HTML handles wonderfully either. My understanding of HTML is that it's inherently structural data, with display information hacked on to make it pretty for the web. We try to take out the display information with stuff like CSS. The structural data communicated by HTML is web specific, and not so appropriate for ebooks, hence having to add an XML file to an ebook format. So what do we retain in HTML? Display information? something it's not that wonderfully suited to and which isn't incredibly relevant for ebooks anyways? I AM willing to lose funny unnecessary formatting information that I would presumably select on my device anyways, ... stuff like font sizes etc. I mean, it also seems like it would make sense if every time you converted to the "open-standard-format," ... the resulting file should be approximately the same. If you have to convert to HTML, then two different conversion programs are going to write very different HTML. Maybe there are some crazy looking colorful fun kids books that need to look like the original paper book, and so it's easier to just use complete HTML to have that kind of support? Also, why are there so many acronyms for this thing? It's so confusing... opf and ops and epub and oeb and oebps and idpf and ocf???? OK I understand that some might be organizations and some might be file extensions and some might be standards for various files in the zipped up file blah blah blah. ..... but cmon, are they just trying to confuse us? How is the average user supposed to figure out if he has the right kind of file?? and then a directory might be the right kind of "file" too? sorry for the rant, ... I'm just trying to understand all this, ... and maybe miss understanding something involved?