KindleUnpack (MobiUnpack): Extracts text, images and metadata from Kindle/Mobi files - Page 12

KevinH · 09-12-2011, 05:40 PM

Hi,

Okay, here are what the extra variable length values mean in the indx;

The first unknown is actually the heading level
with 0- being toplevel, 1 indented 1 level, etc

The second unknown is actually an offset into the CTOC that describes the kind of entry. For my book this pointed to "cover", "other", "titlepage", "copyright", "part", "chapter", etc

If type = 0x3f
unknown 1 = heading level (this seems to be a 1)
unknown 2 = kind of entry (offset into CTOC)
unknown 3 = offset into index data which this entry should be listed under (ie. what it is a sub entry to)

if type = 0xdf
unknown 1 = heading level
unknown 2 = kind of entry (offset into CTOC) (in this case a "part")
unknown 3 = first indx entry included under this part
unknown 4 = last indx entry included under this part

For my ebook each "part" was a "Book 1", "Book 2", etc and under it where the individual "chapters" that below to that "part".

Again, hope this helps. I will examine some more books with complex toc's and try to figure out more.

KevinH

fandrieu · 09-12-2011, 06:13 PM

Quote:

Originally Posted by KevinH

I grabbed it and tried it on a bunch of mobis I had

Thanks very much for the effort you put into testing this.

The code is indeed very rough and there's more chances to get an exception than a ncx

About the "anchor" bit it was really just a wild experiment that worked for the book a was working on...

I hadn't much time to read your posts now, but you are right, the first missing thing is to take the INDX entry type into account, and not just assume 0x0F=chapter as I did to try this out.

BTW i updated the zip i posted with some "essential" stuff that was missing:
* bail if INDX entry if not 0xF (todo...)
* correctly handle end of CTOC (nul terminated)
* basic checking of the indx_header (wrong in some test files)
* ...

KevinH · 09-12-2011, 07:46 PM

Hi fandrieu,

I updated my mobiunpack_test.py to handle all of my mobi indx entries in my test set of mobis (quite limited actually!)

It simply documents things and prints out everything while trying to decipher INDX1, I did nothing with generating the ncx, just some debug output to help with multi-level ncx stuff when you get around to working on it.

So hopefully you will find this useful when incorporating your fixes and things into a real version.

Take care,

Kevin

PS: I tried this on a few other mobis and it barfed. It seems the record format is not even fully determined by the record type. It appears that heading level determines if the parent field is there or not, only specific record types have "kind" information, the order of the fields in each record seem to vary by type and header level.

Arrgghhh! What a mess? Perhaps the Mobi Version number might be useful in determining which fields are present for each record type!

So right now I have to read in record type and header level and to figure out what fields are stored and even that seems to vary from older mobis to newer mobis.

So my mobiunpack_test.py will only work for very specific cases.

PPS:

I added a newer mobiunpack_test.py (mobiunpack_test3.zip) that seems to work for more different mobi ebooks to decipher the INDX1 information.

Again feel free to pick and use as you see fit. Hope this helps!

KevinH

KevinH

kaizoku · 09-13-2011, 11:03 AM

Someone should merge this in one release then to have so many.

DiapDealer · 09-13-2011, 11:23 AM

Quote:

Someone should merge this in one release then to have so many.

There is only one release — v .31 found on this post (ok, the previous version 0.30 is there as well, but you get my point). The rest of the files posted on this page are strictly experimental at this point.

KevinH · 09-13-2011, 11:35 AM

Hi,

Yes, none of my "test" versions nor even the 31_fand versions are close to being ready for prime time. They are only a concise way to pass information back and forth (via reading the code) and is for those programmers who might be interested in helping figure out the indx sections to generate an ncx.

If and When a stable working version exists, it will be clearly marked as such and posted as a version 32 or later.

If you want to play around with things, grab my very latest mobiunpack_test.py (see above) and try it on a non-drm mobi ebook (preferably one with a large multi-level table of contents) and when it barfs (and it will!) pass a log of the program output as well as the ctoc.dat indx1.dat and indx0.dat files generated and stored into a single zip back to here and we can use your error messages to help figure out how our interpretation of the indx1 record structure is broken and hopefully fix it to be more robust.

KevinH

kaizoku · 09-13-2011, 03:17 PM

Warning: Unknown metadata with id 125 found
Warning: Unknown metadata with id 405 found
Warning: Unknown metadata with id 406 found
Warning: Unknown metadata with id 407 found
Warning: Unknown metadata with id 403 found

Hopefully this helps.. only generated 1 .dat files and lots of .data files.

KevinH · 09-13-2011, 03:28 PM

Quote:

Originally Posted by kaizoku

Is there any tutorial on how to run test mobiunpack on mac without a apple script support? I think new lion come with python installed.

Hi,

On a Mac you can use Terminal.app

In a new folder called "test" on your Desktop, copy in mobiunpack_test.py, and a non-drm .mobi ebook file of your choice. Then inside of that "test" folder create an output folder called "out".

Now double-click to run Terminal.app and enter the following commands
(replacing YOUR_MOBI_EBOOK.mobi with the name of the ebook you copied into the test folder):

cd ~/Desktop/test
python ./mobiunpack_test.py YOUR_MOBI_EBOOK.mobi out/ > debug.log

A few different files will be created:
indx0.dat
indx1.dat
ctoc.dat
debug.log

and inside of the out/ directory you should find the ncx, opf, html, etc

KevinH · 09-13-2011, 04:26 PM

Hi,

Some more info. If you look in calibre-0.8.18 source tar gzip archive inside calibre/src/calibre/ebooks/mobi/writer2/ at indexer.py you can see the code that creates the index entries.

Just as siebert suggested, we should be parsing the TAGX entry in INDX0 and deciphering it to find out the various fields that are actually present for each index type.

I am going to study the TAGX object and more specifically the BITMASKS and how they are used to encode values that represent the fields available for each record type for that particular ebook.

Code:

class TAGX(object): # {{{

    BITMASKS = {11:0b1}
    BITMASKS.update({x:(1 << i) for i, x in enumerate([1, 2, 3, 4, 5, 21, 22, 23])})
    BITMASKS.update({x:(1 << i) for i, x in enumerate([69, 70, 71, 72, 73])})

    NUM_VALUES = defaultdict(lambda :1)
    NUM_VALUES[11] = 3
    NUM_VALUES[0] = 0

    def __init__(self):
        self.byts = bytearray()

    def add_tag(self, tag):
        buf = self.byts
        buf.append(tag)
        buf.append(self.NUM_VALUES[tag])
        # bitmask
        buf.append(self.BITMASKS[tag] if tag else 0)
        # eof
        buf.append(0 if tag else 1)

    def header(self, control_byte_count):
        header = b'TAGX'
        # table length, control byte count
        header += pack(b'>II', 12+len(self.byts), control_byte_count)
        return header

    @property
    def periodical(self):
        '''
        TAGX block for the Primary index header of a periodical
        '''
        list(map(self.add_tag, (1, 2, 3, 4, 5, 21, 22, 23, 0, 69, 70, 71, 72,
            73, 0)))
        return self.header(2) + bytes(self.byts)

    @property
    def secondary(self):
        '''
        TAGX block for the secondary index header of a periodical
        '''
        list(map(self.add_tag, (11, 0)))
        return self.header(1) + bytes(self.byts)

    @property
    def flat_book(self):
        '''
        TAGX block for the primary index header of a flat book
        '''
        list(map(self.add_tag, (1, 2, 3, 4, 0)))
        return self.header(1) + bytes(self.byts)

...

The class IndexEntries in that file lists many of the same things I was able to pull out in my example cases:

Code:

class IndexEntry(object):

    TAG_VALUES = {
            'offset': 1,
            'size': 2,
            'label_offset': 3,
            'depth': 4,
            'class_offset': 5,
            'secondary': 11,
            'parent_index': 21,
            'first_child_index': 22,
            'last_child_index': 23,
            'image_index': 69,
            'desc_offset': 70,
            'author_offset': 73,
    }
    RTAG_MAP = {v:k for k, v in TAG_VALUES.iteritems()}

And there are 3 routines that look particularly interesting in that file:

Code:

    @property
    def tag_nums(self):
        for i in range(1, 5):
            yield i
        for attr in ('class_offset', 'parent_index', 'first_child_index',
                'last_child_index'):
            if getattr(self, attr) is not None:
                yield self.TAG_VALUES[attr]

    @property
    def entry_type(self):
        ans = 0
        for tag in self.tag_nums:
            ans |= TAGX.BITMASKS[tag]
        return ans

    def attr_for_tag(self, tag):
        return self.RTAG_MAP[tag]

This is probably old hat to siebert but is all new to me so if anyone has any ideas how to properly decipher the "type" value to map it to the fields that are stored there, it would certainly help. Once we have that, it is relatively painless to write a recursive routine to process the parent / child relationships and convert it to a nice level and sorted list for output as a multilevel ncx.

My 2 cents,

KevinH

siebert · 09-13-2011, 06:13 PM

Quote:

Originally Posted by KevinH

This is probably old hat to siebert but is all new to me so if anyone has any ideas how to properly decipher the "type" value to map it to the fields that are stored there, it would certainly help.

First of all you have to decode the TAGX section for your index. I've documented that in the Wiki (https://wiki.mobileread.com/wiki/MOBI#TAGX_section).

Then you can decode the index entries with the tag table.

Each entry starts with the control byte(s) (the control byte count is defined in the meta index). Using the bit masks from the tag table you can decode which tags are in that index entry and how many entries of each tag.

A bit mask could theoretically contain more than two bits, but I've seen so far only one and two bit masks. If a two-bit mask is all set to 1, it doesn't mean 4 entries of that tag, but after the control byte(s) is another value defining how many entries of that tag are in the entry.

So the control bytes encodes 0, 1, 2, 3 or many entries.

The tag table also defines, how many values each tag has.

With that information you can get all values from an index entry. If you know the meaning of the tag, you can use the values to get the necessary information.

Example:

Control byte count is 1. The tag table has three entries:
0x08, 0x01, 0x03, 0x00 (tag 0x08 has one value and the bitmask 0b11)
0x0a, 0x02, 0x04, 0x00 (tag 0x0a has two values and the bitmask 0b100)
0x00, 0x00, 0x00, 0x01 (end of control byte indictator)

If the first byte of an index entry is 0b00000111, we do an AND operation with the first bitmask and see that the result is 0b11, meaning we must read the next byte to get the actual count of tag 0x08 entries. Let this value be 0x05.

Now we do an AND operation with the next mask and get the result 0b1, so we know that there is one 0x0a entry.

So we've already processed the first two bytes and must now read 5 variable length values for the 5 0x08 tags and 2 variable length values for the one 0x0a tag (as each 0x0a entry contains two values).

If the control byte is 0b00000010, we must read two variable length values for two 0x08 tags.

That's all

I hope it's now clear how to decode an index entry and that I didn't make any mistakes in my description.

As I've said before, the code for this handling is already available in mobiunpack and should be reusable for the ncx index handling.

Ciao,
Steffen

KevinH · 09-13-2011, 09:41 PM

Hi seibert,

Thanks! That helps. I can now decipher the TAGX and find the bitmaps that are used to encode the record type information. I can guess at the what each tag byte means but that is only a guess. Is there any place that documents the meaning of each tag value or did you have to reverse engineer them from the kindlegen program?

For the record, here is what we know/guess based on the work done so far:

Code:

Tag      Decimal  Meaning     
0x01    01          position in the file for the link destination
0x02    02          length / size
0x03    03          title/label offset into CTOC
0x04    04          depth/level of heading (0 = toplevel, 1 = one level down, etc)
0x05    05          class/kind offset into CTOC
0x15    21          parent record number
0x16    22          first child record number
0x17    23          last child record number

which maps exactly to what calibre uses in its indexer.py:

Code:

class IndexEntry(object):

    TAG_VALUES = {
            'offset': 1,
            'size': 2,
            'label_offset': 3,
            'depth': 4,
            'class_offset': 5,
            'secondary': 11,
            'parent_index': 21,
            'first_child_index': 22,
            'last_child_index': 23,
            'image_index': 69,
            'desc_offset': 70,
            'author_offset': 73,
    }

So I guess we will have to work with that. We can try to modify the code to use your TAGX parsing routine to get the tag values and bit masks and then use those to decipher the "type" entry.

Thanks,

Kevin

DaleDe · 09-13-2011, 09:48 PM

This is great interaction and development. I wonder if the dev hub available here would be better for the purpose.

pdurrant · 09-14-2011, 03:50 AM

An interesting idea. I haven't really explored the dev hub.

siebert · 09-14-2011, 04:41 AM

Quote:

Originally Posted by KevinH

I can guess at the what each tag byte means but that is only a guess. Is there any place that documents the meaning of each tag value or did you have to reverse engineer them from the kindlegen program?

The meaning has to be reverse engineered, but that should be easy compared to reverse engineering the index entry structure I described, it took me weeks to figure that out... now that I know it, it appears no longer to be that difficult, though

Ciao,
Steffen

fandrieu · 09-14-2011, 05:52 AM

Hi all,

Lots of new discoveries...still a lot of reverse engineering to do...
As siebert pointed out, the TAGX data indeed seems mandatory to correctly make sense of the INDX.

In the meantime i worked a little on the code, playing with multi levels toc books.
(in the first version i only tried flat tocs, i still haven't touched periodicals...).

* On the "making sense of the data" front, I started with the work done by KevinH in his test version and the only meaningful thing I added is the handling of type 0x7f entries.

They appear to be like 0x1f entries, but of "intermediary" level.

* Other than that i reformated my ugly "TEST NCX" block.
It's now separated from the rest in a method called "parseINDX": it's more readable and easier to call elsewhere in unpackBook.
It also allows to put a bunch of "if error return false" in a row instead of nesting ifs and ifs...

* I also added a DEBUG_NCX global option: it prints a lot of debug and does nothing more than parsing the NCX

* Finally, now that multi level tocs are somehow parsed, i rewrote the "write the ncx file" code to support that.
A new "sortINDX" method re-orders the raw data in the same "flow" as in the NCX, keeping the "hlvl" info instead of forcing 1 as before...

Here is a zip file containing the new file.

I also included the source & mobi of the test book I worked on to find out about the x7f entries, it's just a python-generated dummy book (so no copyright problems) where i can set the toc depth.
The actual file included as 4 levels with 2 entries in each one, it was compiled with kindlegen and then stripped...

Nice work, good luck for what's left to do...

09-12-2011, 07:46 PM	#168
KevinH Sigil Developer Posts: 7,644 Karma: 5433388 Join Date: Nov 2009 Device: many	Hi fandrieu, I updated my mobiunpack_test.py to handle all of my mobi indx entries in my test set of mobis (quite limited actually!) It simply documents things and prints out everything while trying to decipher INDX1, I did nothing with generating the ncx, just some debug output to help with multi-level ncx stuff when you get around to working on it. So hopefully you will find this useful when incorporating your fixes and things into a real version. Take care, Kevin PS: I tried this on a few other mobis and it barfed. It seems the record format is not even fully determined by the record type. It appears that heading level determines if the parent field is there or not, only specific record types have "kind" information, the order of the fields in each record seem to vary by type and header level. Arrgghhh! What a mess? Perhaps the Mobi Version number might be useful in determining which fields are present for each record type! So right now I have to read in record type and header level and to figure out what fields are stored and even that seems to vary from older mobis to newer mobis. So my mobiunpack_test.py will only work for very specific cases. PPS: I added a newer mobiunpack_test.py (mobiunpack_test3.zip) that seems to work for more different mobi ebooks to decipher the INDX1 information. Again feel free to pick and use as you see fit. Hope this helps! KevinH KevinH Last edited by KevinH; 09-15-2011 at 06:55 PM. Reason: added postscript

09-13-2011, 09:41 PM	#176
KevinH Sigil Developer Posts: 7,644 Karma: 5433388 Join Date: Nov 2009 Device: many	Hi seibert, Thanks! That helps. I can now decipher the TAGX and find the bitmaps that are used to encode the record type information. I can guess at the what each tag byte means but that is only a guess. Is there any place that documents the meaning of each tag value or did you have to reverse engineer them from the kindlegen program? For the record, here is what we know/guess based on the work done so far: Code: Tag Decimal Meaning 0x01 01 position in the file for the link destination 0x02 02 length / size 0x03 03 title/label offset into CTOC 0x04 04 depth/level of heading (0 = toplevel, 1 = one level down, etc) 0x05 05 class/kind offset into CTOC 0x15 21 parent record number 0x16 22 first child record number 0x17 23 last child record number which maps exactly to what calibre uses in its indexer.py: Code: class IndexEntry(object): TAG_VALUES = { 'offset': 1, 'size': 2, 'label_offset': 3, 'depth': 4, 'class_offset': 5, 'secondary': 11, 'parent_index': 21, 'first_child_index': 22, 'last_child_index': 23, 'image_index': 69, 'desc_offset': 70, 'author_offset': 73, } So I guess we will have to work with that. We can try to modify the code to use your TAGX parsing routine to get the tag values and bit masks and then use those to decipher the "type" entry. Thanks, Kevin Last edited by KevinH; 09-13-2011 at 09:54 PM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Can i rotate text and insert images in Mobi and EPUB?	JanGLi	Kindle Formats	5	02-02-2013 04:16 PM
PDF to Mobi with text and images	pocketsprocket	Kindle Formats	7	05-21-2012 07:06 AM
Mobi files - images	DWC	Introduce Yourself	5	07-06-2011 01:43 AM
pdf to mobi... creating images rather than text	Dumhed	Calibre	5	11-06-2010 12:08 PM
Transfer of images on text files	anirudh215	PDF	2	06-22-2009 09:28 AM

09-12-2011, 05:40 PM	#166
KevinH Sigil Developer Posts: 7,644 Karma: 5433388 Join Date: Nov 2009 Device: many	Hi, Okay, here are what the extra variable length values mean in the indx; The first unknown is actually the heading level with 0- being toplevel, 1 indented 1 level, etc The second unknown is actually an offset into the CTOC that describes the kind of entry. For my book this pointed to "cover", "other", "titlepage", "copyright", "part", "chapter", etc If type = 0x3f unknown 1 = heading level (this seems to be a 1) unknown 2 = kind of entry (offset into CTOC) unknown 3 = offset into index data which this entry should be listed under (ie. what it is a sub entry to) if type = 0xdf unknown 1 = heading level unknown 2 = kind of entry (offset into CTOC) (in this case a "part") unknown 3 = first indx entry included under this part unknown 4 = last indx entry included under this part For my ebook each "part" was a "Book 1", "Book 2", etc and under it where the individual "chapters" that below to that "part". Again, hope this helps. I will examine some more books with complex toc's and try to figure out more. KevinH

09-13-2011, 11:03 AM	#169
kaizoku Junior Member Posts: 6 Karma: 10 Join Date: Sep 2011 Device: Mac	Someone should merge this in one release then to have so many.

09-13-2011, 11:35 AM	#171
KevinH Sigil Developer Posts: 7,644 Karma: 5433388 Join Date: Nov 2009 Device: many	Hi, Yes, none of my "test" versions nor even the 31_fand versions are close to being ready for prime time. They are only a concise way to pass information back and forth (via reading the code) and is for those programmers who might be interested in helping figure out the indx sections to generate an ncx. If and When a stable working version exists, it will be clearly marked as such and posted as a version 32 or later. If you want to play around with things, grab my very latest mobiunpack_test.py (see above) and try it on a non-drm mobi ebook (preferably one with a large multi-level table of contents) and when it barfs (and it will!) pass a log of the program output as well as the ctoc.dat indx1.dat and indx0.dat files generated and stored into a single zip back to here and we can use your error messages to help figure out how our interpretation of the indx1 record structure is broken and hopefully fix it to be more robust. KevinH

09-13-2011, 09:48 PM	#177
DaleDe Grand Sorcerer Posts: 11,470 Karma: 13095790 Join Date: Aug 2007 Location: Grass Valley, CA Device: EB 1150, EZ Reader, Literati, iPad 2 & Air 2, iPhone 7	This is great interaction and development. I wonder if the dev hub available here would be better for the purpose.

09-14-2011, 03:50 AM	#178
pdurrant The Grand Mouse 高貴的老鼠 Posts: 71,504 Karma: 306214458 Join Date: Jul 2007 Location: Norfolk, England Device: Kindle Voyage	An interesting idea. I haven't really explored the dev hub.

Advert

Advert