Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Formats > Kindle Formats

Notices

Reply
 
Thread Tools Search this Thread
Old 09-12-2011, 06:40 PM   #166
KevinH
Wizard
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 1,004
Karma: 444444
Join Date: Nov 2009
Device: many
Hi,

Okay, here are what the extra variable length values mean in the indx;

The first unknown is actually the heading level
with 0- being toplevel, 1 indented 1 level, etc

The second unknown is actually an offset into the CTOC that describes the kind of entry. For my book this pointed to "cover", "other", "titlepage", "copyright", "part", "chapter", etc

If type = 0x3f
unknown 1 = heading level (this seems to be a 1)
unknown 2 = kind of entry (offset into CTOC)
unknown 3 = offset into index data which this entry should be listed under (ie. what it is a sub entry to)

if type = 0xdf
unknown 1 = heading level
unknown 2 = kind of entry (offset into CTOC) (in this case a "part")
unknown 3 = first indx entry included under this part
unknown 4 = last indx entry included under this part


For my ebook each "part" was a "Book 1", "Book 2", etc and under it where the individual "chapters" that below to that "part".

Again, hope this helps. I will examine some more books with complex toc's and try to figure out more.

KevinH
KevinH is offline   Reply With Quote
Old 09-12-2011, 07:13 PM   #167
fandrieu
Member
fandrieu began at the beginning.
 
Posts: 11
Karma: 10
Join Date: Sep 2011
Device: kindle 3
Quote:
Originally Posted by KevinH View Post
I grabbed it and tried it on a bunch of mobis I had
Thanks very much for the effort you put into testing this.

The code is indeed very rough and there's more chances to get an exception than a ncx

About the "anchor" bit it was really just a wild experiment that worked for the book a was working on...

I hadn't much time to read your posts now, but you are right, the first missing thing is to take the INDX entry type into account, and not just assume 0x0F=chapter as I did to try this out.

BTW i updated the zip i posted with some "essential" stuff that was missing:
* bail if INDX entry if not 0xF (todo...)
* correctly handle end of CTOC (nul terminated)
* basic checking of the indx_header (wrong in some test files)
* ...
fandrieu is offline   Reply With Quote
 
Advertisement
Old 09-12-2011, 08:46 PM   #168
KevinH
Wizard
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 1,004
Karma: 444444
Join Date: Nov 2009
Device: many
Hi fandrieu,

I updated my mobiunpack_test.py to handle all of my mobi indx entries in my test set of mobis (quite limited actually!)

It simply documents things and prints out everything while trying to decipher INDX1, I did nothing with generating the ncx, just some debug output to help with multi-level ncx stuff when you get around to working on it.

So hopefully you will find this useful when incorporating your fixes and things into a real version.

Take care,

Kevin




PS: I tried this on a few other mobis and it barfed. It seems the record format is not even fully determined by the record type. It appears that heading level determines if the parent field is there or not, only specific record types have "kind" information, the order of the fields in each record seem to vary by type and header level.

Arrgghhh! What a mess? Perhaps the Mobi Version number might be useful in determining which fields are present for each record type!

So right now I have to read in record type and header level and to figure out what fields are stored and even that seems to vary from older mobis to newer mobis.

So my mobiunpack_test.py will only work for very specific cases.



PPS:

I added a newer mobiunpack_test.py (mobiunpack_test3.zip) that seems to work for more different mobi ebooks to decipher the INDX1 information.

Again feel free to pick and use as you see fit. Hope this helps!

KevinH

KevinH

Last edited by KevinH; 09-15-2011 at 07:55 PM. Reason: added postscript
KevinH is offline   Reply With Quote
Old 09-13-2011, 12:03 PM   #169
kaizoku
Junior Member
kaizoku began at the beginning.
 
Posts: 6
Karma: 10
Join Date: Sep 2011
Device: Mac
Someone should merge this in one release then to have so many.
kaizoku is offline   Reply With Quote
Old 09-13-2011, 12:23 PM   #170
DiapDealer
Grand Sorcerer
DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.
 
DiapDealer's Avatar
 
Posts: 9,408
Karma: 43171350
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
Quote:
Someone should merge this in one release then to have so many.
There is only one release — v .31 found on this post (ok, the previous version 0.30 is there as well, but you get my point). The rest of the files posted on this page are strictly experimental at this point.
DiapDealer is online now   Reply With Quote
Old 09-13-2011, 12:35 PM   #171
KevinH
Wizard
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 1,004
Karma: 444444
Join Date: Nov 2009
Device: many
Hi,

Yes, none of my "test" versions nor even the 31_fand versions are close to being ready for prime time. They are only a concise way to pass information back and forth (via reading the code) and is for those programmers who might be interested in helping figure out the indx sections to generate an ncx.

If and When a stable working version exists, it will be clearly marked as such and posted as a version 32 or later.

If you want to play around with things, grab my very latest mobiunpack_test.py (see above) and try it on a non-drm mobi ebook (preferably one with a large multi-level table of contents) and when it barfs (and it will!) pass a log of the program output as well as the ctoc.dat indx1.dat and indx0.dat files generated and stored into a single zip back to here and we can use your error messages to help figure out how our interpretation of the indx1 record structure is broken and hopefully fix it to be more robust.

KevinH
KevinH is offline   Reply With Quote
Old 09-13-2011, 04:17 PM   #172
kaizoku
Junior Member
kaizoku began at the beginning.
 
Posts: 6
Karma: 10
Join Date: Sep 2011
Device: Mac
Warning: Unknown metadata with id 125 found
Warning: Unknown metadata with id 405 found
Warning: Unknown metadata with id 406 found
Warning: Unknown metadata with id 407 found
Warning: Unknown metadata with id 403 found

Hopefully this helps.. only generated 1 .dat files and lots of .data files.
Attached Files
File Type: zip indx0.dat.zip (798 Bytes, 41 views)

Last edited by kaizoku; 09-13-2011 at 04:27 PM.
kaizoku is offline   Reply With Quote
Old 09-13-2011, 04:28 PM   #173
KevinH
Wizard
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 1,004
Karma: 444444
Join Date: Nov 2009
Device: many
Quote:
Originally Posted by kaizoku View Post
Is there any tutorial on how to run test mobiunpack on mac without a apple script support? I think new lion come with python installed.
Hi,

On a Mac you can use Terminal.app

In a new folder called "test" on your Desktop, copy in mobiunpack_test.py, and a non-drm .mobi ebook file of your choice. Then inside of that "test" folder create an output folder called "out".

Now double-click to run Terminal.app and enter the following commands
(replacing YOUR_MOBI_EBOOK.mobi with the name of the ebook you copied into the test folder):

cd ~/Desktop/test
python ./mobiunpack_test.py YOUR_MOBI_EBOOK.mobi out/ > debug.log


A few different files will be created:
indx0.dat
indx1.dat
ctoc.dat
debug.log

and inside of the out/ directory you should find the ncx, opf, html, etc
KevinH is offline   Reply With Quote
Old 09-13-2011, 05:26 PM   #174
KevinH
Wizard
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 1,004
Karma: 444444
Join Date: Nov 2009
Device: many
Hi,

Some more info. If you look in calibre-0.8.18 source tar gzip archive inside calibre/src/calibre/ebooks/mobi/writer2/ at indexer.py you can see the code that creates the index entries.

Just as siebert suggested, we should be parsing the TAGX entry in INDX0 and deciphering it to find out the various fields that are actually present for each index type.

I am going to study the TAGX object and more specifically the BITMASKS and how they are used to encode values that represent the fields available for each record type for that particular ebook.
Code:
class TAGX(object): # {{{

    BITMASKS = {11:0b1}
    BITMASKS.update({x:(1 << i) for i, x in enumerate([1, 2, 3, 4, 5, 21, 22, 23])})
    BITMASKS.update({x:(1 << i) for i, x in enumerate([69, 70, 71, 72, 73])})

    NUM_VALUES = defaultdict(lambda :1)
    NUM_VALUES[11] = 3
    NUM_VALUES[0] = 0

    def __init__(self):
        self.byts = bytearray()

    def add_tag(self, tag):
        buf = self.byts
        buf.append(tag)
        buf.append(self.NUM_VALUES[tag])
        # bitmask
        buf.append(self.BITMASKS[tag] if tag else 0)
        # eof
        buf.append(0 if tag else 1)

    def header(self, control_byte_count):
        header = b'TAGX'
        # table length, control byte count
        header += pack(b'>II', 12+len(self.byts), control_byte_count)
        return header

    @property
    def periodical(self):
        '''
        TAGX block for the Primary index header of a periodical
        '''
        list(map(self.add_tag, (1, 2, 3, 4, 5, 21, 22, 23, 0, 69, 70, 71, 72,
            73, 0)))
        return self.header(2) + bytes(self.byts)

    @property
    def secondary(self):
        '''
        TAGX block for the secondary index header of a periodical
        '''
        list(map(self.add_tag, (11, 0)))
        return self.header(1) + bytes(self.byts)

    @property
    def flat_book(self):
        '''
        TAGX block for the primary index header of a flat book
        '''
        list(map(self.add_tag, (1, 2, 3, 4, 0)))
        return self.header(1) + bytes(self.byts)

...
The class IndexEntries in that file lists many of the same things I was able to pull out in my example cases:

Code:
class IndexEntry(object):

    TAG_VALUES = {
            'offset': 1,
            'size': 2,
            'label_offset': 3,
            'depth': 4,
            'class_offset': 5,
            'secondary': 11,
            'parent_index': 21,
            'first_child_index': 22,
            'last_child_index': 23,
            'image_index': 69,
            'desc_offset': 70,
            'author_offset': 73,
    }
    RTAG_MAP = {v:k for k, v in TAG_VALUES.iteritems()}
And there are 3 routines that look particularly interesting in that file:

Code:
    @property
    def tag_nums(self):
        for i in range(1, 5):
            yield i
        for attr in ('class_offset', 'parent_index', 'first_child_index',
                'last_child_index'):
            if getattr(self, attr) is not None:
                yield self.TAG_VALUES[attr]

    @property
    def entry_type(self):
        ans = 0
        for tag in self.tag_nums:
            ans |= TAGX.BITMASKS[tag]
        return ans

    def attr_for_tag(self, tag):
        return self.RTAG_MAP[tag]
This is probably old hat to siebert but is all new to me so if anyone has any ideas how to properly decipher the "type" value to map it to the fields that are stored there, it would certainly help. Once we have that, it is relatively painless to write a recursive routine to process the parent / child relationships and convert it to a nice level and sorted list for output as a multilevel ncx.

My 2 cents,

KevinH

Last edited by KevinH; 09-13-2011 at 05:33 PM.
KevinH is offline   Reply With Quote
Old 09-13-2011, 07:13 PM   #175
siebert
Developer
siebert has a complete set of Star Wars action figures.siebert has a complete set of Star Wars action figures.siebert has a complete set of Star Wars action figures.
 
Posts: 137
Karma: 280
Join Date: Nov 2010
Device: Kindle 3 (Keyboard) 3G / iPad 3 WiFi / Nexus 4 (Android)
Quote:
Originally Posted by KevinH View Post
This is probably old hat to siebert but is all new to me so if anyone has any ideas how to properly decipher the "type" value to map it to the fields that are stored there, it would certainly help.
First of all you have to decode the TAGX section for your index. I've documented that in the Wiki (http://wiki.mobileread.com/wiki/MOBI#TAGX_section).

Then you can decode the index entries with the tag table.

Each entry starts with the control byte(s) (the control byte count is defined in the meta index). Using the bit masks from the tag table you can decode which tags are in that index entry and how many entries of each tag.

A bit mask could theoretically contain more than two bits, but I've seen so far only one and two bit masks. If a two-bit mask is all set to 1, it doesn't mean 4 entries of that tag, but after the control byte(s) is another value defining how many entries of that tag are in the entry.

So the control bytes encodes 0, 1, 2, 3 or many entries.

The tag table also defines, how many values each tag has.

With that information you can get all values from an index entry. If you know the meaning of the tag, you can use the values to get the necessary information.

Example:

Control byte count is 1. The tag table has three entries:
0x08, 0x01, 0x03, 0x00 (tag 0x08 has one value and the bitmask 0b11)
0x0a, 0x02, 0x04, 0x00 (tag 0x0a has two values and the bitmask 0b100)
0x00, 0x00, 0x00, 0x01 (end of control byte indictator)

If the first byte of an index entry is 0b00000111, we do an AND operation with the first bitmask and see that the result is 0b11, meaning we must read the next byte to get the actual count of tag 0x08 entries. Let this value be 0x05.

Now we do an AND operation with the next mask and get the result 0b1, so we know that there is one 0x0a entry.

So we've already processed the first two bytes and must now read 5 variable length values for the 5 0x08 tags and 2 variable length values for the one 0x0a tag (as each 0x0a entry contains two values).

If the control byte is 0b00000010, we must read two variable length values for two 0x08 tags.

That's all

I hope it's now clear how to decode an index entry and that I didn't make any mistakes in my description.

As I've said before, the code for this handling is already available in mobiunpack and should be reusable for the ncx index handling.

Ciao,
Steffen
siebert is offline   Reply With Quote
Old 09-13-2011, 10:41 PM   #176
KevinH
Wizard
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 1,004
Karma: 444444
Join Date: Nov 2009
Device: many
Hi seibert,

Thanks! That helps. I can now decipher the TAGX and find the bitmaps that are used to encode the record type information. I can guess at the what each tag byte means but that is only a guess. Is there any place that documents the meaning of each tag value or did you have to reverse engineer them from the kindlegen program?



For the record, here is what we know/guess based on the work done so far:

Code:
Tag      Decimal  Meaning     
0x01    01          position in the file for the link destination
0x02    02          length / size
0x03    03          title/label offset into CTOC
0x04    04          depth/level of heading (0 = toplevel, 1 = one level down, etc)
0x05    05          class/kind offset into CTOC
0x15    21          parent record number
0x16    22          first child record number
0x17    23          last child record number
which maps exactly to what calibre uses in its indexer.py:

Code:
class IndexEntry(object):

    TAG_VALUES = {
            'offset': 1,
            'size': 2,
            'label_offset': 3,
            'depth': 4,
            'class_offset': 5,
            'secondary': 11,
            'parent_index': 21,
            'first_child_index': 22,
            'last_child_index': 23,
            'image_index': 69,
            'desc_offset': 70,
            'author_offset': 73,
    }

So I guess we will have to work with that. We can try to modify the code to use your TAGX parsing routine to get the tag values and bit masks and then use those to decipher the "type" entry.

Thanks,

Kevin

Last edited by KevinH; 09-13-2011 at 10:54 PM.
KevinH is offline   Reply With Quote
Old 09-13-2011, 10:48 PM   #177
DaleDe
Grand Sorcerer
DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.
 
DaleDe's Avatar
 
Posts: 9,725
Karma: 5072190
Join Date: Aug 2007
Location: Grass Valley, CA
Device: EB 1150, EZ Reader, Literati, iPad 2
This is great interaction and development. I wonder if the dev hub available here would be better for the purpose.
DaleDe is offline   Reply With Quote
Old 09-14-2011, 04:50 AM   #178
pdurrant
The Grand Mouse
pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.
 
pdurrant's Avatar
 
Posts: 32,754
Karma: 89836646
Join Date: Jul 2007
Location: Norfolk, England
Device: NOOK ST GlowLight
An interesting idea. I haven't really explored the dev hub.
pdurrant is offline   Reply With Quote
Old 09-14-2011, 05:41 AM   #179
siebert
Developer
siebert has a complete set of Star Wars action figures.siebert has a complete set of Star Wars action figures.siebert has a complete set of Star Wars action figures.
 
Posts: 137
Karma: 280
Join Date: Nov 2010
Device: Kindle 3 (Keyboard) 3G / iPad 3 WiFi / Nexus 4 (Android)
Quote:
Originally Posted by KevinH View Post
I can guess at the what each tag byte means but that is only a guess. Is there any place that documents the meaning of each tag value or did you have to reverse engineer them from the kindlegen program?
The meaning has to be reverse engineered, but that should be easy compared to reverse engineering the index entry structure I described, it took me weeks to figure that out... now that I know it, it appears no longer to be that difficult, though

Ciao,
Steffen
siebert is offline   Reply With Quote
Old 09-14-2011, 06:52 AM   #180
fandrieu
Member
fandrieu began at the beginning.
 
Posts: 11
Karma: 10
Join Date: Sep 2011
Device: kindle 3
Hi all,

Lots of new discoveries...still a lot of reverse engineering to do...
As siebert pointed out, the TAGX data indeed seems mandatory to correctly make sense of the INDX.

In the meantime i worked a little on the code, playing with multi levels toc books.
(in the first version i only tried flat tocs, i still haven't touched periodicals...).

* On the "making sense of the data" front, I started with the work done by KevinH in his test version and the only meaningful thing I added is the handling of type 0x7f entries.

They appear to be like 0x1f entries, but of "intermediary" level.

* Other than that i reformated my ugly "TEST NCX" block.
It's now separated from the rest in a method called "parseINDX": it's more readable and easier to call elsewhere in unpackBook.
It also allows to put a bunch of "if error return false" in a row instead of nesting ifs and ifs...

* I also added a DEBUG_NCX global option: it prints a lot of debug and does nothing more than parsing the NCX

* Finally, now that multi level tocs are somehow parsed, i rewrote the "write the ncx file" code to support that.
A new "sortINDX" method re-orders the raw data in the same "flow" as in the NCX, keeping the "hlvl" info instead of forcing 1 as before...



Here is a zip file containing the new file.

I also included the source & mobi of the test book I worked on to find out about the x7f entries, it's just a python-generated dummy book (so no copyright problems) where i can set the toc depth.
The actual file included as 4 levels with 2 entries in each one, it was compiled with kindlegen and then stripped...

Nice work, good luck for what's left to do...
Attached Files
File Type: zip mobiunpack_testncx.zip (116.8 KB, 46 views)
fandrieu is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Can i rotate text and insert images in Mobi and EPUB? JanGLi Kindle Formats 5 02-02-2013 05:16 PM
PDF to Mobi with text and images pocketsprocket Kindle Formats 7 05-21-2012 08:06 AM
Mobi files - images DWC Introduce Yourself 5 07-06-2011 02:43 AM
pdf to mobi... creating images rather than text Dumhed Calibre 5 11-06-2010 01:08 PM
Transfer of images on text files anirudh215 PDF 2 06-22-2009 10:28 AM


All times are GMT -4. The time now is 09:27 AM.


MobileRead.com is a privately owned, operated and funded community.