View Single Post
Old 11-24-2007, 04:37 PM   #1
schmidt349 is on a distinguished road
Posts: 20
Karma: 65
Join Date: Nov 2007
Device: Amazon Kindle
The Mobipocket format: Starring Leonardo diCaprio and Kate Winslet

Long time lurker, first time poster.

So, having just come into possession of an Amazon Kindle, I thought I'd load it with some documents that I have in various text-based formats (DocBook XML being chief among them). I read that it supports the Mobipocket format, and being somewhat adept with Perl, I figured I'd whip up some conversion software with the help of CPAN.

What follows is a tale of horror such as you can't imagine.

Just for a lark, I started with the following:

$ strings KindleUsersGuide.azw

and got this:

Q<P>An overview of all the Amazon Kindle features and how to use them.</P>
Kindle User's Guide
232 /


No worries, I thought, it's not a ZIP or a GZIP archive, so they're probably working their own proprietary mojo with some kind of compressed container format. I can see strings up top that look like they're pretty clearly identifiers of some kind (Kindle_Users_Guide, BOOKMOBI, MOBI, and EXTH). I wasn't all that enthusiastic about reverse-engineering somebody's proprietary binary file format, so I visited Mobipocket's web site to look for a document specification.

That was my first mistake.

Nowhere do the Mobipocket people actually give you the secret sauce for their file format. No c code examples, no header or binary structure descriptions, nothing. Their Windoze-only "Mobipocket Creator," despite being marketed as "free software," is anything but -- I almost wish the FSF had a trademark on that term so they could do a legal beatdown on anyone who calls their software "free" just because it doesn't cost anything.

So, no help whatsoever from the Mobipocket crowd. I did discover, though, while browsing their forums that file extensions "prc" and "pdb" are synonyms for "mobi". So I Googled those, thinking that maybe someone somewhere had already done my homework for me.

I knew something was wrong immediately when I was redirected to a bunch of Palm OS-related websites. Imagine my horror when I found out that the mobipocket document container is actually a Palm Database file, a monstrosity that stores everything in a bizarre nonstandard record structure instead of a nice friendly POSIX-compliant directory hierarchy. The sauce on the goose: it stores data in big-endian format because it was originally designed to be used by the very first Palm Pilots, which had Motorola 68000-series microprocessors in them. Wow.

Thankfully CPAN has modules for everything, so I fired up Palm::PDB and Palm:oc, almost hoping that they wouldn't be able to parse the file. However, they didn't have any problems groking the file structure, and my worst fears were realized.

These examples of rotten HTML are drawn from the finally decoded content of the Amazon Kindle Manual, which I grabbed off the device.

Let's start at the beginning:

<p width="0em"><font face="serif">Thank you for purchasing Amazon Kindle. You are reading the Welcome section of the <i>Kindle User's Guide</i>. This guide provides an overview of Kindle and highlights a few basic features so you can start reading as quickly as possible.</font></p>

After came to and peeled my face out of the keyboard I'd just spent five minutes banging it into, I glanced behind myself reflexively, half-expecting to see a blue police box or Billy Pilgrim or some other indication that I had been flung back in time to 1997.

The <font> element is one of those great horrors that we thankfully put to rest with HTML 4.0, XHTML 1.0, and CSS BEFORE THE END OF THE LAST CENTURY. So's the i tag. These are all examples of HTML 3.2-type mixing of document structure and formatting, which isn't supposed to happen under any circumstances in this day and age. You're supposed to use the style attribute along with the generic inline <span> container.

How am I supposed to convert into a format that doesn't even validate as HTML 3.2? How am I expected to use a monstrosity that doesn't conform to ANY of the ebook standards we've established over the last ten years?

The IDPF people have been working on these problems for ages. They came up with a bunch of specifications years ago that would have prevented this nightmare. But this was like the greatest hits of Netscape 2.0. I saw the <center> tag. I saw <li> tags that weren't closed. I saw illegal entities like &. Craploads of tags had the wrong punctuation for their closers (ie, <h4></H4>. Picture references didn't comply with Dublin Core or anything even close to standard. Hell, I half-expected to run into <blink> and <marquee>.

I could not believe I was looking at document markup from the user guide to a device that's supposed to be bleeding-edge.

If someone on this forum is from Mobipocket, I want to know how in good conscience you can continue to use a completely proprietary document container and HTML that looks like I wrote it back in 5th grade. To everyone else I recommend in the strongest possible terms that this format be avoided wherever possible.

I really, really hope that Amazon adds .epub support to the Kindle sometime soon. I already tried loading a document in that format on the device but was told it's unsupported. Otherwise you really are going to have to rely on Amazon for all your content, and good luck using it anywhere else even without DRM.
schmidt349 is offline   Reply With Quote