MobileRead Forums - View Single Post - KindleUnpack (MobiUnpack): Extracts text, images and metadata from Kindle/Mobi files

KevinH · 01-27-2012, 12:16 PM

Hi Nick,

I am integrating your changes into my own version of mobi_unpack_update5 (a few minor updates to what DiapDealer had posted to increase robustness when no css is provided, no ncx exists, etc) and I can't figure out the following.

In your split version you use as mobi header offsets:

first_content_index = 192 (or 0xc0 hex)
last_content_index = 194 (or 0xc2 hex)

You never access first_content_index but you do access last_content_index via >H to find the lastimage as follows:

lastimage = getint(datain_rec0,last_content_index,'H')

Yet my updated mobi_unpack code which was based on testing kindlegen output both when no css is provided (so rawml need never be split since there are no flow pieces), and when multiple css sheets are provided (multiple flow pieces (or svg pieces)) makes use of the following:

# need to use the FDST record to find out how to properly unpack
# the rawML into pieces
# it is simply a table of start and end locations for each flow piece
self.fdst = 0xffffffff
self.fdst, = struct.unpack_from('>L', self.header, 0xc0)
self.fdstcnt, = struct.unpack_from('>L', self.header, 0xc4)
# if cnt is 1 or less, fdst section number can be garbage
if self.fdstcnt <= 1:
self.fdst = 0xffffffff
if self.fdst != 0xffffffff:
self.fdst += self.start

But *only* if this is inside a KF8 Modi Header:

mobi unpack code uses:

# Offset Format Meaning
# ------ ------ -------------
# 0xc0 >L FDST start
# 0xc4 >L Number of records inside FDST

So it appears to me that 0xc0 is either a variable length field in a structure that we have yet to find the a proper indicator for .... or ... its size and meaning is different inside older mobi headers and newer mobi headers.

older mobi header

# Offset Format Meaning
# ------ ------ -------------
# 0xc0 >H first_content_index
# 0xc2 >H last_content_index

kf8 mobi header

# Offset Format Meaning
# ------ ------ -------------
# 0xc0 >L FDST start
# 0xc4 >L Number of records inside FDST

Is this your understanding as well?

Thanks,

KevinH

01-27-2012, 12:16 PM	#274
KevinH Sigil Developer Posts: 8,893 Karma: 6120478 Join Date: Nov 2009 Device: many	Hi Nick, I am integrating your changes into my own version of mobi_unpack_update5 (a few minor updates to what DiapDealer had posted to increase robustness when no css is provided, no ncx exists, etc) and I can't figure out the following. In your split version you use as mobi header offsets: first_content_index = 192 (or 0xc0 hex) last_content_index = 194 (or 0xc2 hex) You never access first_content_index but you do access last_content_index via >H to find the lastimage as follows: lastimage = getint(datain_rec0,last_content_index,'H') Yet my updated mobi_unpack code which was based on testing kindlegen output both when no css is provided (so rawml need never be split since there are no flow pieces), and when multiple css sheets are provided (multiple flow pieces (or svg pieces)) makes use of the following: # need to use the FDST record to find out how to properly unpack # the rawML into pieces # it is simply a table of start and end locations for each flow piece self.fdst = 0xffffffff self.fdst, = struct.unpack_from('>L', self.header, 0xc0) self.fdstcnt, = struct.unpack_from('>L', self.header, 0xc4) # if cnt is 1 or less, fdst section number can be garbage if self.fdstcnt <= 1: self.fdst = 0xffffffff if self.fdst != 0xffffffff: self.fdst += self.start But only if this is inside a KF8 Modi Header: mobi unpack code uses: # Offset Format Meaning # ------ ------ ------------- # 0xc0 >L FDST start # 0xc4 >L Number of records inside FDST So it appears to me that 0xc0 is either a variable length field in a structure that we have yet to find the a proper indicator for .... or ... its size and meaning is different inside older mobi headers and newer mobi headers. older mobi header # Offset Format Meaning # ------ ------ ------------- # 0xc0 >H first_content_index # 0xc2 >H last_content_index kf8 mobi header # Offset Format Meaning # ------ ------ ------------- # 0xc0 >L FDST start # 0xc4 >L Number of records inside FDST Is this your understanding as well? Thanks, KevinH