09-12-2011, 05:40 PM | #166 |
Sigil Developer
Posts: 7,644
Karma: 5433388
Join Date: Nov 2009
Device: many
|
Hi,
Okay, here are what the extra variable length values mean in the indx; The first unknown is actually the heading level with 0- being toplevel, 1 indented 1 level, etc The second unknown is actually an offset into the CTOC that describes the kind of entry. For my book this pointed to "cover", "other", "titlepage", "copyright", "part", "chapter", etc If type = 0x3f unknown 1 = heading level (this seems to be a 1) unknown 2 = kind of entry (offset into CTOC) unknown 3 = offset into index data which this entry should be listed under (ie. what it is a sub entry to) if type = 0xdf unknown 1 = heading level unknown 2 = kind of entry (offset into CTOC) (in this case a "part") unknown 3 = first indx entry included under this part unknown 4 = last indx entry included under this part For my ebook each "part" was a "Book 1", "Book 2", etc and under it where the individual "chapters" that below to that "part". Again, hope this helps. I will examine some more books with complex toc's and try to figure out more. KevinH |
09-12-2011, 06:13 PM | #167 |
Member
Posts: 11
Karma: 10
Join Date: Sep 2011
Device: kindle 3
|
Thanks very much for the effort you put into testing this.
The code is indeed very rough and there's more chances to get an exception than a ncx About the "anchor" bit it was really just a wild experiment that worked for the book a was working on... I hadn't much time to read your posts now, but you are right, the first missing thing is to take the INDX entry type into account, and not just assume 0x0F=chapter as I did to try this out. BTW i updated the zip i posted with some "essential" stuff that was missing: * bail if INDX entry if not 0xF (todo...) * correctly handle end of CTOC (nul terminated) * basic checking of the indx_header (wrong in some test files) * ... |
Advert | |
|
09-12-2011, 07:46 PM | #168 |
Sigil Developer
Posts: 7,644
Karma: 5433388
Join Date: Nov 2009
Device: many
|
Hi fandrieu,
I updated my mobiunpack_test.py to handle all of my mobi indx entries in my test set of mobis (quite limited actually!) It simply documents things and prints out everything while trying to decipher INDX1, I did nothing with generating the ncx, just some debug output to help with multi-level ncx stuff when you get around to working on it. So hopefully you will find this useful when incorporating your fixes and things into a real version. Take care, Kevin PS: I tried this on a few other mobis and it barfed. It seems the record format is not even fully determined by the record type. It appears that heading level determines if the parent field is there or not, only specific record types have "kind" information, the order of the fields in each record seem to vary by type and header level. Arrgghhh! What a mess? Perhaps the Mobi Version number might be useful in determining which fields are present for each record type! So right now I have to read in record type and header level and to figure out what fields are stored and even that seems to vary from older mobis to newer mobis. So my mobiunpack_test.py will only work for very specific cases. PPS: I added a newer mobiunpack_test.py (mobiunpack_test3.zip) that seems to work for more different mobi ebooks to decipher the INDX1 information. Again feel free to pick and use as you see fit. Hope this helps! KevinH KevinH Last edited by KevinH; 09-15-2011 at 06:55 PM. Reason: added postscript |
09-13-2011, 11:03 AM | #169 |
Junior Member
Posts: 6
Karma: 10
Join Date: Sep 2011
Device: Mac
|
Someone should merge this in one release then to have so many.
|
09-13-2011, 11:23 AM | #170 | |
Grand Sorcerer
Posts: 27,547
Karma: 193191846
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
|
Quote:
|
|
Advert | |
|
09-13-2011, 11:35 AM | #171 |
Sigil Developer
Posts: 7,644
Karma: 5433388
Join Date: Nov 2009
Device: many
|
Hi,
Yes, none of my "test" versions nor even the 31_fand versions are close to being ready for prime time. They are only a concise way to pass information back and forth (via reading the code) and is for those programmers who might be interested in helping figure out the indx sections to generate an ncx. If and When a stable working version exists, it will be clearly marked as such and posted as a version 32 or later. If you want to play around with things, grab my very latest mobiunpack_test.py (see above) and try it on a non-drm mobi ebook (preferably one with a large multi-level table of contents) and when it barfs (and it will!) pass a log of the program output as well as the ctoc.dat indx1.dat and indx0.dat files generated and stored into a single zip back to here and we can use your error messages to help figure out how our interpretation of the indx1 record structure is broken and hopefully fix it to be more robust. KevinH |
09-13-2011, 03:17 PM | #172 |
Junior Member
Posts: 6
Karma: 10
Join Date: Sep 2011
Device: Mac
|
Warning: Unknown metadata with id 125 found
Warning: Unknown metadata with id 405 found Warning: Unknown metadata with id 406 found Warning: Unknown metadata with id 407 found Warning: Unknown metadata with id 403 found Hopefully this helps.. only generated 1 .dat files and lots of .data files. Last edited by kaizoku; 09-13-2011 at 03:27 PM. |
09-13-2011, 03:28 PM | #173 | |
Sigil Developer
Posts: 7,644
Karma: 5433388
Join Date: Nov 2009
Device: many
|
Quote:
On a Mac you can use Terminal.app In a new folder called "test" on your Desktop, copy in mobiunpack_test.py, and a non-drm .mobi ebook file of your choice. Then inside of that "test" folder create an output folder called "out". Now double-click to run Terminal.app and enter the following commands (replacing YOUR_MOBI_EBOOK.mobi with the name of the ebook you copied into the test folder): cd ~/Desktop/test python ./mobiunpack_test.py YOUR_MOBI_EBOOK.mobi out/ > debug.log A few different files will be created: indx0.dat indx1.dat ctoc.dat debug.log and inside of the out/ directory you should find the ncx, opf, html, etc |
|
09-13-2011, 04:26 PM | #174 |
Sigil Developer
Posts: 7,644
Karma: 5433388
Join Date: Nov 2009
Device: many
|
Hi,
Some more info. If you look in calibre-0.8.18 source tar gzip archive inside calibre/src/calibre/ebooks/mobi/writer2/ at indexer.py you can see the code that creates the index entries. Just as siebert suggested, we should be parsing the TAGX entry in INDX0 and deciphering it to find out the various fields that are actually present for each index type. I am going to study the TAGX object and more specifically the BITMASKS and how they are used to encode values that represent the fields available for each record type for that particular ebook. Code:
class TAGX(object): # {{{ BITMASKS = {11:0b1} BITMASKS.update({x:(1 << i) for i, x in enumerate([1, 2, 3, 4, 5, 21, 22, 23])}) BITMASKS.update({x:(1 << i) for i, x in enumerate([69, 70, 71, 72, 73])}) NUM_VALUES = defaultdict(lambda :1) NUM_VALUES[11] = 3 NUM_VALUES[0] = 0 def __init__(self): self.byts = bytearray() def add_tag(self, tag): buf = self.byts buf.append(tag) buf.append(self.NUM_VALUES[tag]) # bitmask buf.append(self.BITMASKS[tag] if tag else 0) # eof buf.append(0 if tag else 1) def header(self, control_byte_count): header = b'TAGX' # table length, control byte count header += pack(b'>II', 12+len(self.byts), control_byte_count) return header @property def periodical(self): ''' TAGX block for the Primary index header of a periodical ''' list(map(self.add_tag, (1, 2, 3, 4, 5, 21, 22, 23, 0, 69, 70, 71, 72, 73, 0))) return self.header(2) + bytes(self.byts) @property def secondary(self): ''' TAGX block for the secondary index header of a periodical ''' list(map(self.add_tag, (11, 0))) return self.header(1) + bytes(self.byts) @property def flat_book(self): ''' TAGX block for the primary index header of a flat book ''' list(map(self.add_tag, (1, 2, 3, 4, 0))) return self.header(1) + bytes(self.byts) ... Code:
class IndexEntry(object): TAG_VALUES = { 'offset': 1, 'size': 2, 'label_offset': 3, 'depth': 4, 'class_offset': 5, 'secondary': 11, 'parent_index': 21, 'first_child_index': 22, 'last_child_index': 23, 'image_index': 69, 'desc_offset': 70, 'author_offset': 73, } RTAG_MAP = {v:k for k, v in TAG_VALUES.iteritems()} Code:
@property def tag_nums(self): for i in range(1, 5): yield i for attr in ('class_offset', 'parent_index', 'first_child_index', 'last_child_index'): if getattr(self, attr) is not None: yield self.TAG_VALUES[attr] @property def entry_type(self): ans = 0 for tag in self.tag_nums: ans |= TAGX.BITMASKS[tag] return ans def attr_for_tag(self, tag): return self.RTAG_MAP[tag] My 2 cents, KevinH Last edited by KevinH; 09-13-2011 at 04:33 PM. |
09-13-2011, 06:13 PM | #175 | |
Developer
Posts: 155
Karma: 280
Join Date: Nov 2010
Device: Kindle 3 (Keyboard) 3G / iPad 9 WiFi / Google Pixel 6a (Android)
|
Quote:
Then you can decode the index entries with the tag table. Each entry starts with the control byte(s) (the control byte count is defined in the meta index). Using the bit masks from the tag table you can decode which tags are in that index entry and how many entries of each tag. A bit mask could theoretically contain more than two bits, but I've seen so far only one and two bit masks. If a two-bit mask is all set to 1, it doesn't mean 4 entries of that tag, but after the control byte(s) is another value defining how many entries of that tag are in the entry. So the control bytes encodes 0, 1, 2, 3 or many entries. The tag table also defines, how many values each tag has. With that information you can get all values from an index entry. If you know the meaning of the tag, you can use the values to get the necessary information. Example: Control byte count is 1. The tag table has three entries: 0x08, 0x01, 0x03, 0x00 (tag 0x08 has one value and the bitmask 0b11) 0x0a, 0x02, 0x04, 0x00 (tag 0x0a has two values and the bitmask 0b100) 0x00, 0x00, 0x00, 0x01 (end of control byte indictator) If the first byte of an index entry is 0b00000111, we do an AND operation with the first bitmask and see that the result is 0b11, meaning we must read the next byte to get the actual count of tag 0x08 entries. Let this value be 0x05. Now we do an AND operation with the next mask and get the result 0b1, so we know that there is one 0x0a entry. So we've already processed the first two bytes and must now read 5 variable length values for the 5 0x08 tags and 2 variable length values for the one 0x0a tag (as each 0x0a entry contains two values). If the control byte is 0b00000010, we must read two variable length values for two 0x08 tags. That's all I hope it's now clear how to decode an index entry and that I didn't make any mistakes in my description. As I've said before, the code for this handling is already available in mobiunpack and should be reusable for the ncx index handling. Ciao, Steffen |
|
09-13-2011, 09:41 PM | #176 |
Sigil Developer
Posts: 7,644
Karma: 5433388
Join Date: Nov 2009
Device: many
|
Hi seibert,
Thanks! That helps. I can now decipher the TAGX and find the bitmaps that are used to encode the record type information. I can guess at the what each tag byte means but that is only a guess. Is there any place that documents the meaning of each tag value or did you have to reverse engineer them from the kindlegen program? For the record, here is what we know/guess based on the work done so far: Code:
Tag Decimal Meaning 0x01 01 position in the file for the link destination 0x02 02 length / size 0x03 03 title/label offset into CTOC 0x04 04 depth/level of heading (0 = toplevel, 1 = one level down, etc) 0x05 05 class/kind offset into CTOC 0x15 21 parent record number 0x16 22 first child record number 0x17 23 last child record number Code:
class IndexEntry(object): TAG_VALUES = { 'offset': 1, 'size': 2, 'label_offset': 3, 'depth': 4, 'class_offset': 5, 'secondary': 11, 'parent_index': 21, 'first_child_index': 22, 'last_child_index': 23, 'image_index': 69, 'desc_offset': 70, 'author_offset': 73, } So I guess we will have to work with that. We can try to modify the code to use your TAGX parsing routine to get the tag values and bit masks and then use those to decipher the "type" entry. Thanks, Kevin Last edited by KevinH; 09-13-2011 at 09:54 PM. |
09-13-2011, 09:48 PM | #177 |
Grand Sorcerer
Posts: 11,470
Karma: 13095790
Join Date: Aug 2007
Location: Grass Valley, CA
Device: EB 1150, EZ Reader, Literati, iPad 2 & Air 2, iPhone 7
|
This is great interaction and development. I wonder if the dev hub available here would be better for the purpose.
|
09-14-2011, 03:50 AM | #178 |
The Grand Mouse 高貴的老鼠
Posts: 71,504
Karma: 306214458
Join Date: Jul 2007
Location: Norfolk, England
Device: Kindle Voyage
|
An interesting idea. I haven't really explored the dev hub.
|
09-14-2011, 04:41 AM | #179 | |
Developer
Posts: 155
Karma: 280
Join Date: Nov 2010
Device: Kindle 3 (Keyboard) 3G / iPad 9 WiFi / Google Pixel 6a (Android)
|
Quote:
Ciao, Steffen |
|
09-14-2011, 05:52 AM | #180 |
Member
Posts: 11
Karma: 10
Join Date: Sep 2011
Device: kindle 3
|
Hi all,
Lots of new discoveries...still a lot of reverse engineering to do... As siebert pointed out, the TAGX data indeed seems mandatory to correctly make sense of the INDX. In the meantime i worked a little on the code, playing with multi levels toc books. (in the first version i only tried flat tocs, i still haven't touched periodicals...). * On the "making sense of the data" front, I started with the work done by KevinH in his test version and the only meaningful thing I added is the handling of type 0x7f entries. They appear to be like 0x1f entries, but of "intermediary" level. * Other than that i reformated my ugly "TEST NCX" block. It's now separated from the rest in a method called "parseINDX": it's more readable and easier to call elsewhere in unpackBook. It also allows to put a bunch of "if error return false" in a row instead of nesting ifs and ifs... * I also added a DEBUG_NCX global option: it prints a lot of debug and does nothing more than parsing the NCX * Finally, now that multi level tocs are somehow parsed, i rewrote the "write the ncx file" code to support that. A new "sortINDX" method re-orders the raw data in the same "flow" as in the NCX, keeping the "hlvl" info instead of forcing 1 as before... Here is a zip file containing the new file. I also included the source & mobi of the test book I worked on to find out about the x7f entries, it's just a python-generated dummy book (so no copyright problems) where i can set the toc depth. The actual file included as 4 levels with 2 entries in each one, it was compiled with kindlegen and then stripped... Nice work, good luck for what's left to do... |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Can i rotate text and insert images in Mobi and EPUB? | JanGLi | Kindle Formats | 5 | 02-02-2013 04:16 PM |
PDF to Mobi with text and images | pocketsprocket | Kindle Formats | 7 | 05-21-2012 07:06 AM |
Mobi files - images | DWC | Introduce Yourself | 5 | 07-06-2011 01:43 AM |
pdf to mobi... creating images rather than text | Dumhed | Calibre | 5 | 11-06-2010 12:08 PM |
Transfer of images on text files | anirudh215 | 2 | 06-22-2009 09:28 AM |