Hi,
Some more info. If you look in calibre-0.8.18 source tar gzip archive inside calibre/src/calibre/ebooks/mobi/writer2/ at indexer.py you can see the code that creates the index entries.
Just as siebert suggested, we should be parsing the TAGX entry in INDX0 and deciphering it to find out the various fields that are actually present for each index type.
I am going to study the TAGX object and more specifically the BITMASKS and how they are used to encode values that represent the fields available for each record type for that particular ebook.
Code:
class TAGX(object): # {{{
BITMASKS = {11:0b1}
BITMASKS.update({x:(1 << i) for i, x in enumerate([1, 2, 3, 4, 5, 21, 22, 23])})
BITMASKS.update({x:(1 << i) for i, x in enumerate([69, 70, 71, 72, 73])})
NUM_VALUES = defaultdict(lambda :1)
NUM_VALUES[11] = 3
NUM_VALUES[0] = 0
def __init__(self):
self.byts = bytearray()
def add_tag(self, tag):
buf = self.byts
buf.append(tag)
buf.append(self.NUM_VALUES[tag])
# bitmask
buf.append(self.BITMASKS[tag] if tag else 0)
# eof
buf.append(0 if tag else 1)
def header(self, control_byte_count):
header = b'TAGX'
# table length, control byte count
header += pack(b'>II', 12+len(self.byts), control_byte_count)
return header
@property
def periodical(self):
'''
TAGX block for the Primary index header of a periodical
'''
list(map(self.add_tag, (1, 2, 3, 4, 5, 21, 22, 23, 0, 69, 70, 71, 72,
73, 0)))
return self.header(2) + bytes(self.byts)
@property
def secondary(self):
'''
TAGX block for the secondary index header of a periodical
'''
list(map(self.add_tag, (11, 0)))
return self.header(1) + bytes(self.byts)
@property
def flat_book(self):
'''
TAGX block for the primary index header of a flat book
'''
list(map(self.add_tag, (1, 2, 3, 4, 0)))
return self.header(1) + bytes(self.byts)
...
The class IndexEntries in that file lists many of the same things I was able to pull out in my example cases:
Code:
class IndexEntry(object):
TAG_VALUES = {
'offset': 1,
'size': 2,
'label_offset': 3,
'depth': 4,
'class_offset': 5,
'secondary': 11,
'parent_index': 21,
'first_child_index': 22,
'last_child_index': 23,
'image_index': 69,
'desc_offset': 70,
'author_offset': 73,
}
RTAG_MAP = {v:k for k, v in TAG_VALUES.iteritems()}
And there are 3 routines that look particularly interesting in that file:
Code:
@property
def tag_nums(self):
for i in range(1, 5):
yield i
for attr in ('class_offset', 'parent_index', 'first_child_index',
'last_child_index'):
if getattr(self, attr) is not None:
yield self.TAG_VALUES[attr]
@property
def entry_type(self):
ans = 0
for tag in self.tag_nums:
ans |= TAGX.BITMASKS[tag]
return ans
def attr_for_tag(self, tag):
return self.RTAG_MAP[tag]
This is probably old hat to siebert but is all new to me so if anyone has any ideas how to properly decipher the "type" value to map it to the fields that are stored there, it would certainly help. Once we have that, it is relatively painless to write a recursive routine to process the parent / child relationships and convert it to a nice level and sorted list for output as a multilevel ncx.
My 2 cents,
KevinH