KindleUnpack (MobiUnpack): Extracts text, images and metadata from Kindle/Mobi files - Page 14

KevinH · 09-15-2011, 12:41 PM

Quote:

Originally Posted by siebert

Hi,

I've looked into the latest source provided by fandrieu and the handling seems to make some shortcuts. I assume that the ncx index also contains a IDXT section, why don't you don't use it to find the start and end position of each entry, so you can verify that you've decoded all bytes?

The tag handling code will work only if all bitmasks are single bits. Is this always the case? I would then at least add an assertion which will fail for non-single bitmasks.

Ciao,
Steffen

Hi Steffen,

If type & mask == mask should work for whatever the bitmask is assuming it is truly a mask (ie. that all bits set in the mask exist (are set) in the type value since & is a bitwise operator

If the tagx bitmask has more than one bit set, then that is captured by the mask.

What am I missing?

siebert · 09-15-2011, 12:50 PM

Quote:

Originally Posted by KevinH

If type & mask == mask should work for whatever the bitmask is assuming it is truly a mask (ie. that all bits set in the mask exist (are set) in the type value since & is a bitwise operator

[...]

What am I missing?

You're only testing whether all bits are set or not. For a 1-bit mask this is ok, as it can only encode the values 0 and 1.

With a two-bit mask the encoded values are 0, 1, 2 and 3 (but 3 means more than 2 and the real value is stored in a separate byte). The current code doesn't decode these values.

Ciao,
Steffen

KevinH · 09-15-2011, 04:49 PM

Quote:

Originally Posted by siebert

You're only testing whether all bits are set or not. For a 1-bit mask this is ok, as it can only encode the values 0 and 1.

With a two-bit mask the encoded values are 0, 1, 2 and 3 (but 3 means more than 2 and the real value is stored in a separate byte). The current code doesn't decode these values.

Ciao,
Steffen

Hi Steffen,

I am not sure I understand

If I assume the following (note tag1 has more than 1 bit set in its mask)

tag 1 has bitmask 0x03 and requires 1 value be read in as field 1
tag 2 has bitmask 0x01 and requires 1 value be read in as field 2
tag 3 has bitmask 0x02 and requires 1 value be read in as field 3
tag 4 has bitmask 0x08 and requires 1 value to be read in as field 4

And if type == 0x07:

I would read in the first value as field 1, next value as field 2, next value as field 3 and no further values would be read in for this particular entry since the bitmask & type != bitmask for tag 4.

I think you are saying this is wrong ...

I think you are saying that a bitmask with two bits set means something different from how I am interpreting it?

If so, via a concrete example, could you explain how a bitmask with more than 1 bit set should be interpreted if it is not as I had assumed above.

Thanks!

Kevin

siebert · 09-15-2011, 05:20 PM

Quote:

Originally Posted by KevinH

Hi Steffen,

I am not sure I understand

If I assume the following (note tag1 has more than 1 bit set in its mask)

tag 1 has bitmask 0x03 and requires 1 value be read in as field 1
tag 2 has bitmask 0x01 and requires 1 value be read in as field 2
tag 3 has bitmask 0x02 and requires 1 value be read in as field 3
tag 4 has bitmask 0x08 and requires 1 value to be read in as field 4

And if type == 0x07:

I would read in the first value as field 1, next value as field 2, next value as field 3 and no further values would be read in for this particular entry since the bitmask & type != bitmask for tag 4.

0x07 = 0b00000111
Tag 1:
0x03 = 0b00000011
0x07 AND 0x03 = 0x03
This would mean that we have 3 values of tag 1. But as I've said, for multi-bit masks a result of all ones (like in this case) the real number can be anything > 2 and you have to read one byte (or a multibyte value, don't remember which one) to get the real number of tag 1.

If type would be 0x06 instead of 0x07:
0x06 and 0x03 = 0x02
This would mean we have 2 values of tag 1

Tag2:
I'm confused. The mask 0x01 collides with the mask 0x03. It's not possible to have a tag with mask 0x01 and another with mask 0x03 in the same control byte.

The control byte works as follows. You have one byte (8 bits) and want to encode the number of tag values for several tags with these 8 bits.

If a tag can occur only once, you need one bit. If it can occur several times, you need more bits. All masks I've seen so far had a maximum of 2 bits.

Let's say you have 3 tags with one bit and one tag with two bits than you should see the following masks:

0b00000001 = 0x01 for tag1
0b00000010 = 0x02 for tag2
0b00000100 = 0x04 for tag3
0b00011000 = 0x18 for tag4

A control byte of 0x15 would then decode as:
0b00010101

1 * tag 1
0 * tag 2
1 * tag 3
2 * tag 4

I hope it's now clear what I mean.

Ciao,
Steffen

KevinH · 09-15-2011, 06:52 PM

Hi Steffen,

Yes, thanks! That is much clearer.

An NCX entry can never have more than one parent, can only have one position, one class, one length, and although it could have many children, the children are actually indicated by two different values which provide a range - the first of which is the record number of the first child of this ncx entry and the second of which is the record number of the last child of this ncx entry (as a range).

So it appears that 1 bit is only ever needed. I would guess your inflection dictionaries are much much more complicated than the ncx entries.

So because of the structure of the fields, I believe that multi-bit masks as you describe below are never used.

And I agree we should at least run a test and warn if multi-bit masks are ever found in the NCX code.

Thanks!

Kevin

Quote:

Let's say you have 3 tags with one bit and one tag with two bits than you should see the following masks:

0b00000001 = 0x01 for tag1
0b00000010 = 0x02 for tag2
0b00000100 = 0x04 for tag3
0b00011000 = 0x18 for tag4

A control byte of 0x15 would then decode as:
0b00010101

1 * tag 1
0 * tag 2
1 * tag 3
2 * tag 4

I hope it's now clear what I mean.

DiapDealer · 09-15-2011, 07:22 PM

Quote:

Originally Posted by fandrieu

Hehe, i didn't take the time to check your latest fixes (pretty late here), but you seem to have spotted the misplaced outncx=False line

I just wanted to add another bit that troubled me:
I merged the (hopefully fixed) sortINDX & buildNCX functions, removing an "evolutionary" clutch with the added bonus of correct indenting (but didn't take much time to test it though...)

Your latest (mobiunpack_testncx_onemore.zip) still has a bit of a bug if the mobi doesn't have an ncx. You're building the opf so that the spine always indicates the toc="ncx" bit. Like so (line 1540):

Code:

if outncx:
    outncxbasename = os.path.basename(outncx)
    data += '<item id="ncx" media-type="application/x-dtbncx+xml" href="'+outncxbasename+'"></item>\n'
data.append('</manifest>\n<spine toc="ncx">\n<itemref idref="item1"/>\n</spine>\n<tours>\n</tours>\n')

What it needs to be is:

Code:

if outncx:
    outncxbasename = os.path.basename(outncx)
    data += '<item id="ncx" media-type="application/x-dtbncx+xml" href="'+outncxbasename+'"></item>\n'
    data.append('</manifest>\n<spine toc="ncx">\n<itemref idref="item1"/>\n</spine>\n<tours>\n</tours>\n')
else:
    data.append('</manifest>\n<spine>\n<itemref idref="item1"/>\n</spine>\n<tours>\n</tours>\n')

KevinH · 09-15-2011, 08:37 PM

Hi All,

To prevent duplication of effort ...

Does anyone here want to take a shot at refactoring and or adding classes. I would assume class-wise we could have an NCX related class, and OPF related class, and a Dictionary related classes (and the NCX and Dictionary could share and TAGX, INDX classes if need be) and try and clean up the code, shrink it wherever possible, and make sure the routines that obviously belong to a class get encapsulated by that class.

Any Takers?

KevinH

siebert · 09-16-2011, 01:32 AM

Quote:

Originally Posted by KevinH

Hi Steffen,

Yes, thanks! That is much clearer.

[...]

So it appears that 1 bit is only ever needed. I would guess your inflection dictionaries are much much more complicated than the ncx entries.

I think we should follow the DRY (don't repeat yourself) principle.

The getTagMap() function should already do everything needed to decode an index entry (including multi-bit masks), so I suggest using it also for ncx index handling.

Ciao,
Steffen

fandrieu · 09-16-2011, 06:03 AM

Quote:

Originally Posted by pdurrant

Bear in mind that calibre-generated Mobipocket files might not be valid in all instances, since the code was written with reverse-engineered info, not with documentation of the format.

Absolutely, btw does anyone knows a way to spot calibre-generated books / identify the book generator ?

Quote:

Originally Posted by siebert

I assume that the ncx index also contains a IDXT section, why don't you don't use it to find the start and end position of each entry, so you can verify that you've decoded all bytes?

I was thinking about what's left to parse, and so the INDXT section at the end of INDX0 came to mind.

I didn't you what was in there when I started this, if i understand you right it contains the position of the actual index entries in the INDX1, is that it ?

If someon didn't already do it, I'll look into it...

Quote:

Originally Posted by DiapDealer

Your latest (mobiunpack_testncx_onemore.zip) still has a bit of a bug if the mobi doesn't have an ncx. You're building the opf so that the spine always indicates the toc="ncx" bit.

You're absolutely right, and the worse is I new about it but had never fixed it

That's what happens you release proof of concept code in a haste

...

About the TAGX / mask stuff thanks for all your input and shaping this code into something correct, I knew too little about the mobi format to figure that out quickly...

About the refactor bit go ahead KevinH if you feel like it, perhaps you could share here a skeleton of the class structure, so that other can comment / improve on it ?

KevinH · 09-16-2011, 08:48 AM

Hi,

Quote:

Originally Posted by fandrieu

About the refactor bit go ahead KevinH if you feel like it, perhaps you could share here a skeleton of the class structure, so that other can comment / improve on it ?

Actually classes have restarted and my teaching/research takes up most of my free time now so I was actually fishing for someone else to take over that duty!!!!

KevinH

fandrieu · 09-16-2011, 09:11 AM

As siebert suggested, I modified the code to use the IDXT data in INDX1 to "find" the entries.

Before we relied exclusively on the TAGX data and assumed that after having parsed an entry we would be correctly positioned at the start of the next one.

Now the offsets found in the IDXT are used to (i hope) accurately find each entry: there should be no more "positioning" bugs.

(note that to do that I chose to pass the whole INDX section to parseINDX1 (before it was only the "navdata" part) so that the offsets in IDXT are used verbatim...)

....

I also fixed "buildNCX" to correctly (i guess) set "playOrder" and "dtb:depth".

....

Also I didn't mention it before, but in my previous version I introduced some code to include the "filepos" found in INDX in the "anchor" algorithm.
Without that, if a link is found only in the NCX and not in the html itself, the corresponding anchor would be missing.

...

@KevinH: I'll try to look into it over the week end if possible....

kovidgoyal · 09-16-2011, 12:36 PM

I just came across this thread. Some tips:

1) Do not use the INDX entries to build an NCX. INDX entries can have a maximum depth of two for books and 3 for periodicals. This is a limitation of the MOBI format. Instead parse the inline TOC, calculate the left indents and reconstruct the NCX from that. See code in mob/reader.py in calibre to do that.

2) If you still want to decompile indx entries and are looking at calibre code to figure out index entries:

a) note that currently indexer.py does not generate depth 2 indx entries for books, primarily because I got tired figuring out the TBS indexing for depth to book nodes.
b) you should look at the code in mobi/debug.py which is designed to decompile arbitrary MOBI files including the index and TBS information. You can run that code with calibre-debug --inspect-mobi filename.mobi

DiapDealer · 09-16-2011, 01:34 PM

Hi fandrieu,

I'm getting consistent results with many, many books with the latest script (from post #206), so thanks for your efforts!

I wanted to point out one potential problem area, though. Line 1036:

Code:

entry = re.sub('^', indent, entry, 0, re.M)

The above code will only work with python 2.7. If you want to include 2.6 and 2.5 users, consider replacing that line with this compatible code:

Code:

entry = re.sub(re.compile('^', re.M), indent, entry, 0)

KevinH · 09-16-2011, 02:17 PM

Hi,

And for my own personal sanity please change one other little thing:

print "Wite NCX"

to

print "Write NCX"

;-)

Thanks!

Kevin

Quote:

Originally Posted by DiapDealer

Hi fandrieu,

I'm getting consistent results with many, many books with the latest script (from post #206), so thanks for your efforts!

I wanted to point out one potential problem area, though. Line 1036:

Code:

entry = re.sub('^', indent, entry, 0, re.M)

The above code will only work with python 2.7. If you want to include 2.6 and 2.5 users, consider replacing that line with this compatible code:

Code:

entry = re.sub(re.compile('^', re.M), indent, entry, 0)

KevinH · 09-16-2011, 02:25 PM

Hi Kovid,

Quote:

Originally Posted by kovidgoyal

I just came across this thread. Some tips:
1) Do not use the INDX entries to build an NCX. INDX entries can have a maximum depth of two for books and 3 for periodicals. This is a limitation of the MOBI format. Instead parse the inline TOC, calculate the left indents and reconstruct the NCX from that. See code in mob/reader.py in calibre to do that.

I think the idea is to eventually use "mobiunpack.py" as a way for people to take mobi's generated by KindleGen, unpack them making the fewest changes as possible", Allow the user to make whatever changes they want and then pass the whole thing back through KindleGen to get back a mobi.

So I think the idea is to generate the NCX that is stored inside the mobi and pass it back in so that it get's regenerated in the exact same way.

Thus the idea to look at the internal ncx and not parse to create one of our own.

Quote:

2) If you still want to decompile indx entries and are looking at calibre code to figure out index entries:

a) note that currently indexer.py does not generate depth 2 indx entries for books, primarily because I got tired figuring out the TBS indexing for depth to book nodes.

Used your indexer.py code to verify what the tag values are and what they mean (parent, first_child, last_child, class, etc). Our code already handle's reading in depth 2 for ebooks (tested with books from Kindlegen, etc). But I have not tried it with a Periodical at all.

Quote:

b) you should look at the code in mobi/debug.py which is designed to decompile arbitrary MOBI files including the index and TBS information. You can run that code with calibre-debug --inspect-mobi filename.mobi

Great tool. Will do.

Thanks,

KevinH

09-15-2011, 08:37 PM	#202
KevinH Sigil Developer Posts: 7,630 Karma: 5433388 Join Date: Nov 2009 Device: many	Hi All, To prevent duplication of effort ... Does anyone here want to take a shot at refactoring and or adding classes. I would assume class-wise we could have an NCX related class, and OPF related class, and a Dictionary related classes (and the NCX and Dictionary could share and TAGX, INDX classes if need be) and try and clean up the code, shrink it wherever possible, and make sure the routines that obviously belong to a class get encapsulated by that class. Any Takers? KevinH Last edited by KevinH; 09-15-2011 at 10:52 PM.

09-16-2011, 01:34 PM	#208
DiapDealer Grand Sorcerer Posts: 27,546 Karma: 193191846 Join Date: Jan 2010 Device: Nexus 7, Kindle Fire HD	Hi fandrieu, I'm getting consistent results with many, many books with the latest script (from post #206), so thanks for your efforts! I wanted to point out one potential problem area, though. Line 1036: Code: entry = re.sub('^', indent, entry, 0, re.M) The above code will only work with python 2.7. If you want to include 2.6 and 2.5 users, consider replacing that line with this compatible code: Code: entry = re.sub(re.compile('^', re.M), indent, entry, 0)

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Can i rotate text and insert images in Mobi and EPUB?	JanGLi	Kindle Formats	5	02-02-2013 04:16 PM
PDF to Mobi with text and images	pocketsprocket	Kindle Formats	7	05-21-2012 07:06 AM
Mobi files - images	DWC	Introduce Yourself	5	07-06-2011 01:43 AM
pdf to mobi... creating images rather than text	Dumhed	Calibre	5	11-06-2010 12:08 PM
Transfer of images on text files	anirudh215	PDF	2	06-22-2009 09:28 AM

09-16-2011, 12:36 PM	#207
kovidgoyal creator of calibre Posts: 43,842 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	I just came across this thread. Some tips: 1) Do not use the INDX entries to build an NCX. INDX entries can have a maximum depth of two for books and 3 for periodicals. This is a limitation of the MOBI format. Instead parse the inline TOC, calculate the left indents and reconstruct the NCX from that. See code in mob/reader.py in calibre to do that. 2) If you still want to decompile indx entries and are looking at calibre code to figure out index entries: a) note that currently indexer.py does not generate depth 2 indx entries for books, primarily because I got tired figuring out the TBS indexing for depth to book nodes. b) you should look at the code in mobi/debug.py which is designed to decompile arbitrary MOBI files including the index and TBS information. You can run that code with calibre-debug --inspect-mobi filename.mobi

Advert

Advert