MobileRead Forums - View Single Post

pdurrant · 06-16-2011, 04:03 PM

Quote:

Originally Posted by osnova

By the way, my understanding is that mobi is not just an archive or a package, it is a compiled file. All the "unpackers" are based on reverse engineering of the mobi format but do not create 1-to-1 correspondence to the original html source file that was used to create the mobi file. So, you may lose some data if you just unpack the mobi (e.g., dictionary tags will be lost, link anchors will be renamed). I may be wrong but I haven't seen yet tools that keep all these data.

For non-dictionary Mobipocket, it's pretty good. The HTML export is only tweaked enough to fix the links in the HTML and to add character set metadata. If yo look at the python code, there are some commented out bits if you really want to extract the exact raw text.

There are some problems with the metadata. Specifically, the code currently only handles one instance of each EXTH type, while there can be more than one, e.g. more than one author.

But yes, it's a compiled up file, not just a zipped up file like ePub. The current tools get you something that you can recompile with minimal metadata losses using Kindlegen.