MobileRead Forums - View Single Post - KindleUnpack (MobiUnpack): Extracts text, images and metadata from Kindle/Mobi files

KevinH · 09-13-2011, 04:26 PM

Hi,

Some more info. If you look in calibre-0.8.18 source tar gzip archive inside calibre/src/calibre/ebooks/mobi/writer2/ at indexer.py you can see the code that creates the index entries.

Just as siebert suggested, we should be parsing the TAGX entry in INDX0 and deciphering it to find out the various fields that are actually present for each index type.

I am going to study the TAGX object and more specifically the BITMASKS and how they are used to encode values that represent the fields available for each record type for that particular ebook.

Code:

class TAGX(object): # {{{

    BITMASKS = {11:0b1}
    BITMASKS.update({x:(1 << i) for i, x in enumerate([1, 2, 3, 4, 5, 21, 22, 23])})
    BITMASKS.update({x:(1 << i) for i, x in enumerate([69, 70, 71, 72, 73])})

    NUM_VALUES = defaultdict(lambda :1)
    NUM_VALUES[11] = 3
    NUM_VALUES[0] = 0

    def __init__(self):
        self.byts = bytearray()

    def add_tag(self, tag):
        buf = self.byts
        buf.append(tag)
        buf.append(self.NUM_VALUES[tag])
        # bitmask
        buf.append(self.BITMASKS[tag] if tag else 0)
        # eof
        buf.append(0 if tag else 1)

    def header(self, control_byte_count):
        header = b'TAGX'
        # table length, control byte count
        header += pack(b'>II', 12+len(self.byts), control_byte_count)
        return header

    @property
    def periodical(self):
        '''
        TAGX block for the Primary index header of a periodical
        '''
        list(map(self.add_tag, (1, 2, 3, 4, 5, 21, 22, 23, 0, 69, 70, 71, 72,
            73, 0)))
        return self.header(2) + bytes(self.byts)

    @property
    def secondary(self):
        '''
        TAGX block for the secondary index header of a periodical
        '''
        list(map(self.add_tag, (11, 0)))
        return self.header(1) + bytes(self.byts)

    @property
    def flat_book(self):
        '''
        TAGX block for the primary index header of a flat book
        '''
        list(map(self.add_tag, (1, 2, 3, 4, 0)))
        return self.header(1) + bytes(self.byts)

...

The class IndexEntries in that file lists many of the same things I was able to pull out in my example cases:

Code:

class IndexEntry(object):

    TAG_VALUES = {
            'offset': 1,
            'size': 2,
            'label_offset': 3,
            'depth': 4,
            'class_offset': 5,
            'secondary': 11,
            'parent_index': 21,
            'first_child_index': 22,
            'last_child_index': 23,
            'image_index': 69,
            'desc_offset': 70,
            'author_offset': 73,
    }
    RTAG_MAP = {v:k for k, v in TAG_VALUES.iteritems()}

And there are 3 routines that look particularly interesting in that file:

Code:

    @property
    def tag_nums(self):
        for i in range(1, 5):
            yield i
        for attr in ('class_offset', 'parent_index', 'first_child_index',
                'last_child_index'):
            if getattr(self, attr) is not None:
                yield self.TAG_VALUES[attr]

    @property
    def entry_type(self):
        ans = 0
        for tag in self.tag_nums:
            ans |= TAGX.BITMASKS[tag]
        return ans

    def attr_for_tag(self, tag):
        return self.RTAG_MAP[tag]

This is probably old hat to siebert but is all new to me so if anyone has any ideas how to properly decipher the "type" value to map it to the fields that are stored there, it would certainly help. Once we have that, it is relatively painless to write a recursive routine to process the parent / child relationships and convert it to a nice level and sorted list for output as a multilevel ncx.

My 2 cents,

KevinH