MobileRead Forums - View Single Post - KindleUnpack (MobiUnpack): Extracts text, images and metadata from Kindle/Mobi files

fandrieu · 09-14-2011, 05:52 AM

Hi all,

Lots of new discoveries...still a lot of reverse engineering to do...
As siebert pointed out, the TAGX data indeed seems mandatory to correctly make sense of the INDX.

In the meantime i worked a little on the code, playing with multi levels toc books.
(in the first version i only tried flat tocs, i still haven't touched periodicals...).

* On the "making sense of the data" front, I started with the work done by KevinH in his test version and the only meaningful thing I added is the handling of type 0x7f entries.

They appear to be like 0x1f entries, but of "intermediary" level.

* Other than that i reformated my ugly "TEST NCX" block.
It's now separated from the rest in a method called "parseINDX": it's more readable and easier to call elsewhere in unpackBook.
It also allows to put a bunch of "if error return false" in a row instead of nesting ifs and ifs...

* I also added a DEBUG_NCX global option: it prints a lot of debug and does nothing more than parsing the NCX

* Finally, now that multi level tocs are somehow parsed, i rewrote the "write the ncx file" code to support that.
A new "sortINDX" method re-orders the raw data in the same "flow" as in the NCX, keeping the "hlvl" info instead of forcing 1 as before...

Here is a zip file containing the new file.

I also included the source & mobi of the test book I worked on to find out about the x7f entries, it's just a python-generated dummy book (so no copyright problems) where i can set the toc depth.
The actual file included as 4 levels with 2 entries in each one, it was compiled with kindlegen and then stripped...

Nice work, good luck for what's left to do...