Getting text length from mobi header.

mattst · 03-26-2012, 07:19 AM

Hi,

Is there a simple way to get the text length from a .mobi file's header?

Eg. Seek to file position n, read 4 byes as big endian.

I've been trying to work this out from the Calibre source code, but not having any experiance of Python code and the source code having very few comments, this is proving to be rather hard.

If anyone knows of any C/C++ code which reads a .mobi file header that would be very helpful.

Many thanks.

KevinH · 03-26-2012, 04:00 PM

Hi,

Check out the much simpler DumpMobiHeader.py:

https://www.mobileread.com/forums/sho...63&postcount=8

I have no idea why the only the text length would be useful since it is the uncompressed length, and may include css files and svg snippets (in a KF8 Mobi) and needs to be processed to get back to what is needed as input (for both older mobis and newer KF8 mobis). The actual text is stored in seprate sections with trailing byte sequences in other sections of the palm database file (a .mobi is a palmdatabase file)

If you examine DumpMobiHeader.py - and if you can read C/C++ you will have no problem with reading python - the only issue is that python uses whitespace indentation to indicate what is part of a loop, if statement, or any block - you will see the following:

Code:

    mobi6_header = {
            'compression_type'  : (0x00, '>H', 2),
            'fill0'             : (0x02, '>H', 2),
            'text_length'       : (0x04, '>L', 4),
            'text_records'      : (0x08, '>H', 2),
            'max_section_size'  : (0x0a, '>H', 2),
            'crypto_type'       : (0x0c, '>H', 2),
            'fill1'             : (0x0e, '>H', 2),
            'magic'             : (0x10, '4s', 4),
            'header_length'     : (0x14, '>L', 4),
            'type'              : (0x18, '>L', 4),
            ...

The 'magic' value is MOBI. So the easiest way to find the text_length assuming you want nothing else is to open the ebook in any editor and look for the first string 'MOBI" that comes *after* "BOOKMOBI" near the front of the ebook and then step back exactly 12 bytes to find the beginning of the text_length field which is stored as a BIG_ENDIAN sequence of bytes.

A better method would be to play around with the DumpMobiHeader.py and examine actual ebook files in any good hex editor to understand how it works.

You can also read our own MobileRead Wiki about the Mobi Format that will help.

mattst · 03-26-2012, 05:17 PM

Thanks for the helpful and informative reply KevinH.

I'm writing a Linux command line tool to create .apnx page number files. By examining the Calibre apnx.py source code I saw that to calculate the number of chars per page the Mobi text length header value is divided by the number of pages (from the print edition of the book). It won't map perfectly of course but that doesn't matter - the idea is to get a reasonable approximation.

I'm assuming that it does not matter if the mobi file is compressed or not, the mapping of page positions will still be valid. Otherwise the Calibre APNX file generator would not work. Or am I missing something? Anyway I'll be finding out when I test with both compressed and uncompressed mobi files (of the same book).

As far as I can tell from the Calibre header code... To find the start of the mobi header all I need do is to seek 78 bytes into the file and then read a 4 byte big endian - that value will be the offset from the beginning of the file to the start of the mobi file's header. By doing that, then skipping the next 4 bytes ('compression_type' and 'fill0' in your header code above), and then reading the next 4 bytes as a big endian, my code is getting the same 'text_length' values as DumpMobiHeader.py gives me for the 3 mobi files I tested so far. What I'd like to know is if seeking 78 bytes to get the big endian which is the offset to the start of the header will work with ALL mobi files or if there are variations? Do you have any idea about that?

Thanks again.

KevinH · 03-26-2012, 06:11 PM

Hi,

Yes, seeking 78 bytes in will get you the palm database index of sections/records and the first record is the mobi header.

This is true for older style mobis but not fully for new style KF8 mobis. Newer style mobis KF8 mobis actually have two mobi header records inside the .mobi file. The first is at record 0 (just like above) and the second follows immediately after the 8 byte record whose value is 'BOUNDARY'. You have to walk through the palm db to find it.

Starting at byte 78 is a table whose entries have two big endian word values (8 bytes). The first value is the offset in the file to the start of that record, and the second is a flag value. So to read the 0'th record of the palm db file you read the 4 byte big endian value which is stored at byte 78, and the next four bytes are its flag value.

To read the starting offset of the ith record you read the 4 byte big endian value which is stored at byte 78 + (8*i) in the file.

The total number of records in the table are provided by the two bytes at position 76 in the file (again stored big endian).

So it is easy to read the structure of the palmdb that makes up .mobi files and hop to the start of any record in the palmdb file.

So to find record 0 is exactly as you described. Walking the records are easy looking for one that is exactly 8 bytes long whose value is BOUNDARY and then the very next records is another mobi header (this time a mobi 8 header).

The only other problem case would be Topaz files which are also stored in palmdbs but do not have a similar header structure. The easier way to find and ignore them is to see if the file starts with "TPZ".

Hope this helps,

Kevin

mattst · 03-28-2012, 01:02 PM

Many thanks KevinH - once again a very helpful and informative response. I really appreciate the time you've taken to explain all this to me.

I am unable to find any KF8 or Topaz mobi example files or the web in order to use for tests. Would you (or anyone else here) happen to have an example file of each which you could upload for me?

Thanks again,

mattst

KevinH · 03-28-2012, 01:11 PM

Hi,

All Topaz .azw files have DRM. You will have to access one of those yourself. Perhaps there is a sample you could download from the Amazon site.

As for the KF8 mobis, they are easy to generate. Simply use Amazon's latest Kindlegen (v2) (available as a free donwload for Windows and Mac) and pass it any epub and it will convert it to a KF8 ebook which will have two mobi headers in the same .mobi file. This is the new format for all mobis so that they can be read on all devices (the old mobi header and its data is used by older Kindles whereas the new mobi header is detected by newer Kindles Fire, Kindle for PC, Kindle for Mac and it will show advanced formatting (fonts, true css, etc).

KevinH

Quote:

Originally Posted by mattst

Many thanks KevinH - once again a very helpful and informative response. I really appreciate the time you've taken to explain all this to me.

I am unable to find any KF8 or Topaz mobi example files or the web in order to use for tests. Would you (or anyone else here) happen to have an example file of each which you could upload for me?

Thanks again,

mattst

DaleDe · 03-28-2012, 08:31 PM

There are also some Sample Books in the wiki in the page Sample Books.

Dale

mattst · 03-29-2012, 06:31 AM

Thanks again KevinH.

I can forget about Topaz files then, my software will clearly only work on files with no DRM.

If Kindlegen creates KF8 mobis then I'm sorted and can create some for testing with ease.

PS. Thanks for the pointer Dale.

03-26-2012, 07:19 AM	#1
mattst Enthusiast Posts: 32 Karma: 10 Join Date: Nov 2011 Device: Kindle	Getting text length from mobi header. Hi, Is there a simple way to get the text length from a .mobi file's header? Eg. Seek to file position n, read 4 byes as big endian. I've been trying to work this out from the Calibre source code, but not having any experiance of Python code and the source code having very few comments, this is proving to be rather hard. If anyone knows of any C/C++ code which reads a .mobi file header that would be very helpful. Many thanks.

03-26-2012, 04:00 PM	#2
KevinH Sigil Developer Posts: 7,630 Karma: 5433388 Join Date: Nov 2009 Device: many	Hi, Check out the much simpler DumpMobiHeader.py: https://www.mobileread.com/forums/sho...63&postcount=8 I have no idea why the only the text length would be useful since it is the uncompressed length, and may include css files and svg snippets (in a KF8 Mobi) and needs to be processed to get back to what is needed as input (for both older mobis and newer KF8 mobis). The actual text is stored in seprate sections with trailing byte sequences in other sections of the palm database file (a .mobi is a palmdatabase file) If you examine DumpMobiHeader.py - and if you can read C/C++ you will have no problem with reading python - the only issue is that python uses whitespace indentation to indicate what is part of a loop, if statement, or any block - you will see the following: Code: mobi6_header = { 'compression_type' : (0x00, '>H', 2), 'fill0' : (0x02, '>H', 2), 'text_length' : (0x04, '>L', 4), 'text_records' : (0x08, '>H', 2), 'max_section_size' : (0x0a, '>H', 2), 'crypto_type' : (0x0c, '>H', 2), 'fill1' : (0x0e, '>H', 2), 'magic' : (0x10, '4s', 4), 'header_length' : (0x14, '>L', 4), 'type' : (0x18, '>L', 4), ... The 'magic' value is MOBI. So the easiest way to find the text_length assuming you want nothing else is to open the ebook in any editor and look for the first string 'MOBI" that comes after "BOOKMOBI" near the front of the ebook and then step back exactly 12 bytes to find the beginning of the text_length field which is stored as a BIG_ENDIAN sequence of bytes. A better method would be to play around with the DumpMobiHeader.py and examine actual ebook files in any good hex editor to understand how it works. You can also read our own MobileRead Wiki about the Mobi Format that will help. Last edited by KevinH; 03-26-2012 at 04:20 PM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Programmatically reading mobi EXTH header	Limey	Kindle Formats	13	07-25-2012 06:48 PM
Update Mobi header/file metadata without doing a Mobi to Mobi conversion	RecQuery	Conversion	2	06-30-2012 11:43 AM
.mobi corrupted zip header	John123	Calibre	2	11-29-2011 10:47 AM
PDF Conversion - Removing Header / Footer Text	heb	Sony Reader	9	07-11-2010 11:02 PM
need vba/word scripting help to turn inline text into header	Bierkonig	Workshop	3	01-09-2009 09:40 PM

03-26-2012, 05:17 PM	#3
mattst Enthusiast Posts: 32 Karma: 10 Join Date: Nov 2011 Device: Kindle	Thanks for the helpful and informative reply KevinH. I'm writing a Linux command line tool to create .apnx page number files. By examining the Calibre apnx.py source code I saw that to calculate the number of chars per page the Mobi text length header value is divided by the number of pages (from the print edition of the book). It won't map perfectly of course but that doesn't matter - the idea is to get a reasonable approximation. I'm assuming that it does not matter if the mobi file is compressed or not, the mapping of page positions will still be valid. Otherwise the Calibre APNX file generator would not work. Or am I missing something? Anyway I'll be finding out when I test with both compressed and uncompressed mobi files (of the same book). As far as I can tell from the Calibre header code... To find the start of the mobi header all I need do is to seek 78 bytes into the file and then read a 4 byte big endian - that value will be the offset from the beginning of the file to the start of the mobi file's header. By doing that, then skipping the next 4 bytes ('compression_type' and 'fill0' in your header code above), and then reading the next 4 bytes as a big endian, my code is getting the same 'text_length' values as DumpMobiHeader.py gives me for the 3 mobi files I tested so far. What I'd like to know is if seeking 78 bytes to get the big endian which is the offset to the start of the header will work with ALL mobi files or if there are variations? Do you have any idea about that? Thanks again.

03-26-2012, 06:11 PM	#4
KevinH Sigil Developer Posts: 7,630 Karma: 5433388 Join Date: Nov 2009 Device: many	Hi, Yes, seeking 78 bytes in will get you the palm database index of sections/records and the first record is the mobi header. This is true for older style mobis but not fully for new style KF8 mobis. Newer style mobis KF8 mobis actually have two mobi header records inside the .mobi file. The first is at record 0 (just like above) and the second follows immediately after the 8 byte record whose value is 'BOUNDARY'. You have to walk through the palm db to find it. Starting at byte 78 is a table whose entries have two big endian word values (8 bytes). The first value is the offset in the file to the start of that record, and the second is a flag value. So to read the 0'th record of the palm db file you read the 4 byte big endian value which is stored at byte 78, and the next four bytes are its flag value. To read the starting offset of the ith record you read the 4 byte big endian value which is stored at byte 78 + (8*i) in the file. The total number of records in the table are provided by the two bytes at position 76 in the file (again stored big endian). So it is easy to read the structure of the palmdb that makes up .mobi files and hop to the start of any record in the palmdb file. So to find record 0 is exactly as you described. Walking the records are easy looking for one that is exactly 8 bytes long whose value is BOUNDARY and then the very next records is another mobi header (this time a mobi 8 header). The only other problem case would be Topaz files which are also stored in palmdbs but do not have a similar header structure. The easier way to find and ignore them is to see if the file starts with "TPZ". Hope this helps, Kevin

03-28-2012, 01:02 PM	#5
mattst Enthusiast Posts: 32 Karma: 10 Join Date: Nov 2011 Device: Kindle	Many thanks KevinH - once again a very helpful and informative response. I really appreciate the time you've taken to explain all this to me. I am unable to find any KF8 or Topaz mobi example files or the web in order to use for tests. Would you (or anyone else here) happen to have an example file of each which you could upload for me? Thanks again, mattst

03-28-2012, 08:31 PM	#7
DaleDe Grand Sorcerer Posts: 11,470 Karma: 13095790 Join Date: Aug 2007 Location: Grass Valley, CA Device: EB 1150, EZ Reader, Literati, iPad 2 & Air 2, iPhone 7	There are also some Sample Books in the wiki in the page Sample Books. Dale

03-29-2012, 06:31 AM	#8
mattst Enthusiast Posts: 32 Karma: 10 Join Date: Nov 2011 Device: Kindle	Thanks again KevinH. I can forget about Topaz files then, my software will clearly only work on files with no DRM. If Kindlegen creates KF8 mobis then I'm sorted and can create some for testing with ease. PS. Thanks for the pointer Dale.