Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Formats > Kindle Formats

Notices

Reply
 
Thread Tools Search this Thread
Old 09-15-2011, 12:41 PM   #196
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 7,630
Karma: 5433388
Join Date: Nov 2009
Device: many
Quote:
Originally Posted by siebert View Post
Hi,

I've looked into the latest source provided by fandrieu and the handling seems to make some shortcuts. I assume that the ncx index also contains a IDXT section, why don't you don't use it to find the start and end position of each entry, so you can verify that you've decoded all bytes?

The tag handling code will work only if all bitmasks are single bits. Is this always the case? I would then at least add an assertion which will fail for non-single bitmasks.

Ciao,
Steffen
Hi Steffen,

If type & mask == mask should work for whatever the bitmask is assuming it is truly a mask (ie. that all bits set in the mask exist (are set) in the type value since & is a bitwise operator

If the tagx bitmask has more than one bit set, then that is captured by the mask.

What am I missing?
KevinH is offline   Reply With Quote
Old 09-15-2011, 12:50 PM   #197
siebert
Developer
siebert has a complete set of Star Wars action figures.siebert has a complete set of Star Wars action figures.siebert has a complete set of Star Wars action figures.
 
Posts: 155
Karma: 280
Join Date: Nov 2010
Device: Kindle 3 (Keyboard) 3G / iPad 9 WiFi / Google Pixel 6a (Android)
Quote:
Originally Posted by KevinH View Post
If type & mask == mask should work for whatever the bitmask is assuming it is truly a mask (ie. that all bits set in the mask exist (are set) in the type value since & is a bitwise operator

[...]

What am I missing?
You're only testing whether all bits are set or not. For a 1-bit mask this is ok, as it can only encode the values 0 and 1.

With a two-bit mask the encoded values are 0, 1, 2 and 3 (but 3 means more than 2 and the real value is stored in a separate byte). The current code doesn't decode these values.

Ciao,
Steffen
siebert is offline   Reply With Quote
Advert
Old 09-15-2011, 04:49 PM   #198
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 7,630
Karma: 5433388
Join Date: Nov 2009
Device: many
Quote:
Originally Posted by siebert View Post
You're only testing whether all bits are set or not. For a 1-bit mask this is ok, as it can only encode the values 0 and 1.

With a two-bit mask the encoded values are 0, 1, 2 and 3 (but 3 means more than 2 and the real value is stored in a separate byte). The current code doesn't decode these values.

Ciao,
Steffen
Hi Steffen,

I am not sure I understand

If I assume the following (note tag1 has more than 1 bit set in its mask)

tag 1 has bitmask 0x03 and requires 1 value be read in as field 1
tag 2 has bitmask 0x01 and requires 1 value be read in as field 2
tag 3 has bitmask 0x02 and requires 1 value be read in as field 3
tag 4 has bitmask 0x08 and requires 1 value to be read in as field 4

And if type == 0x07:

I would read in the first value as field 1, next value as field 2, next value as field 3 and no further values would be read in for this particular entry since the bitmask & type != bitmask for tag 4.


I think you are saying this is wrong ...

I think you are saying that a bitmask with two bits set means something different from how I am interpreting it?

If so, via a concrete example, could you explain how a bitmask with more than 1 bit set should be interpreted if it is not as I had assumed above.

Thanks!

Kevin
KevinH is offline   Reply With Quote
Old 09-15-2011, 05:20 PM   #199
siebert
Developer
siebert has a complete set of Star Wars action figures.siebert has a complete set of Star Wars action figures.siebert has a complete set of Star Wars action figures.
 
Posts: 155
Karma: 280
Join Date: Nov 2010
Device: Kindle 3 (Keyboard) 3G / iPad 9 WiFi / Google Pixel 6a (Android)
Quote:
Originally Posted by KevinH View Post
Hi Steffen,

I am not sure I understand

If I assume the following (note tag1 has more than 1 bit set in its mask)

tag 1 has bitmask 0x03 and requires 1 value be read in as field 1
tag 2 has bitmask 0x01 and requires 1 value be read in as field 2
tag 3 has bitmask 0x02 and requires 1 value be read in as field 3
tag 4 has bitmask 0x08 and requires 1 value to be read in as field 4

And if type == 0x07:

I would read in the first value as field 1, next value as field 2, next value as field 3 and no further values would be read in for this particular entry since the bitmask & type != bitmask for tag 4.
0x07 = 0b00000111
Tag 1:
0x03 = 0b00000011
0x07 AND 0x03 = 0x03
This would mean that we have 3 values of tag 1. But as I've said, for multi-bit masks a result of all ones (like in this case) the real number can be anything > 2 and you have to read one byte (or a multibyte value, don't remember which one) to get the real number of tag 1.

If type would be 0x06 instead of 0x07:
0x06 and 0x03 = 0x02
This would mean we have 2 values of tag 1

Tag2:
I'm confused. The mask 0x01 collides with the mask 0x03. It's not possible to have a tag with mask 0x01 and another with mask 0x03 in the same control byte.

The control byte works as follows. You have one byte (8 bits) and want to encode the number of tag values for several tags with these 8 bits.

If a tag can occur only once, you need one bit. If it can occur several times, you need more bits. All masks I've seen so far had a maximum of 2 bits.

Let's say you have 3 tags with one bit and one tag with two bits than you should see the following masks:

0b00000001 = 0x01 for tag1
0b00000010 = 0x02 for tag2
0b00000100 = 0x04 for tag3
0b00011000 = 0x18 for tag4

A control byte of 0x15 would then decode as:
0b00010101

1 * tag 1
0 * tag 2
1 * tag 3
2 * tag 4

I hope it's now clear what I mean.

Ciao,
Steffen
siebert is offline   Reply With Quote
Old 09-15-2011, 06:52 PM   #200
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 7,630
Karma: 5433388
Join Date: Nov 2009
Device: many
Hi Steffen,

Yes, thanks! That is much clearer.

An NCX entry can never have more than one parent, can only have one position, one class, one length, and although it could have many children, the children are actually indicated by two different values which provide a range - the first of which is the record number of the first child of this ncx entry and the second of which is the record number of the last child of this ncx entry (as a range).

So it appears that 1 bit is only ever needed. I would guess your inflection dictionaries are much much more complicated than the ncx entries.

So because of the structure of the fields, I believe that multi-bit masks as you describe below are never used.

And I agree we should at least run a test and warn if multi-bit masks are ever found in the NCX code.

Thanks!

Kevin

Quote:
Let's say you have 3 tags with one bit and one tag with two bits than you should see the following masks:

0b00000001 = 0x01 for tag1
0b00000010 = 0x02 for tag2
0b00000100 = 0x04 for tag3
0b00011000 = 0x18 for tag4

A control byte of 0x15 would then decode as:
0b00010101

1 * tag 1
0 * tag 2
1 * tag 3
2 * tag 4

I hope it's now clear what I mean.
KevinH is offline   Reply With Quote
Advert
Old 09-15-2011, 07:22 PM   #201
DiapDealer
Grand Sorcerer
DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.
 
DiapDealer's Avatar
 
Posts: 27,546
Karma: 193191846
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
Quote:
Originally Posted by fandrieu View Post
Hehe, i didn't take the time to check your latest fixes (pretty late here), but you seem to have spotted the misplaced outncx=False line

I just wanted to add another bit that troubled me:
I merged the (hopefully fixed) sortINDX & buildNCX functions, removing an "evolutionary" clutch with the added bonus of correct indenting (but didn't take much time to test it though...)
Your latest (mobiunpack_testncx_onemore.zip) still has a bit of a bug if the mobi doesn't have an ncx. You're building the opf so that the spine always indicates the toc="ncx" bit. Like so (line 1540):
Code:
if outncx:
    outncxbasename = os.path.basename(outncx)
    data += '<item id="ncx" media-type="application/x-dtbncx+xml" href="'+outncxbasename+'"></item>\n'
data.append('</manifest>\n<spine toc="ncx">\n<itemref idref="item1"/>\n</spine>\n<tours>\n</tours>\n')
What it needs to be is:
Code:
if outncx:
    outncxbasename = os.path.basename(outncx)
    data += '<item id="ncx" media-type="application/x-dtbncx+xml" href="'+outncxbasename+'"></item>\n'
    data.append('</manifest>\n<spine toc="ncx">\n<itemref idref="item1"/>\n</spine>\n<tours>\n</tours>\n')
else:
    data.append('</manifest>\n<spine>\n<itemref idref="item1"/>\n</spine>\n<tours>\n</tours>\n')
DiapDealer is offline   Reply With Quote
Old 09-15-2011, 08:37 PM   #202
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 7,630
Karma: 5433388
Join Date: Nov 2009
Device: many
Hi All,

To prevent duplication of effort ...

Does anyone here want to take a shot at refactoring and or adding classes. I would assume class-wise we could have an NCX related class, and OPF related class, and a Dictionary related classes (and the NCX and Dictionary could share and TAGX, INDX classes if need be) and try and clean up the code, shrink it wherever possible, and make sure the routines that obviously belong to a class get encapsulated by that class.

Any Takers?

KevinH

Last edited by KevinH; 09-15-2011 at 10:52 PM.
KevinH is offline   Reply With Quote
Old 09-16-2011, 01:32 AM   #203
siebert
Developer
siebert has a complete set of Star Wars action figures.siebert has a complete set of Star Wars action figures.siebert has a complete set of Star Wars action figures.
 
Posts: 155
Karma: 280
Join Date: Nov 2010
Device: Kindle 3 (Keyboard) 3G / iPad 9 WiFi / Google Pixel 6a (Android)
Quote:
Originally Posted by KevinH View Post
Hi Steffen,

Yes, thanks! That is much clearer.

[...]

So it appears that 1 bit is only ever needed. I would guess your inflection dictionaries are much much more complicated than the ncx entries.
I think we should follow the DRY (don't repeat yourself) principle.

The getTagMap() function should already do everything needed to decode an index entry (including multi-bit masks), so I suggest using it also for ncx index handling.

Ciao,
Steffen
siebert is offline   Reply With Quote
Old 09-16-2011, 06:03 AM   #204
fandrieu
Member
fandrieu began at the beginning.
 
Posts: 11
Karma: 10
Join Date: Sep 2011
Device: kindle 3
Quote:
Originally Posted by pdurrant View Post
Bear in mind that calibre-generated Mobipocket files might not be valid in all instances, since the code was written with reverse-engineered info, not with documentation of the format.
Absolutely, btw does anyone knows a way to spot calibre-generated books / identify the book generator ?


Quote:
Originally Posted by siebert View Post
I assume that the ncx index also contains a IDXT section, why don't you don't use it to find the start and end position of each entry, so you can verify that you've decoded all bytes?
I was thinking about what's left to parse, and so the INDXT section at the end of INDX0 came to mind.

I didn't you what was in there when I started this, if i understand you right it contains the position of the actual index entries in the INDX1, is that it ?

If someon didn't already do it, I'll look into it...


Quote:
Originally Posted by DiapDealer View Post
Your latest (mobiunpack_testncx_onemore.zip) still has a bit of a bug if the mobi doesn't have an ncx. You're building the opf so that the spine always indicates the toc="ncx" bit.
You're absolutely right, and the worse is I new about it but had never fixed it
That's what happens you release proof of concept code in a haste

...

About the TAGX / mask stuff thanks for all your input and shaping this code into something correct, I knew too little about the mobi format to figure that out quickly...

About the refactor bit go ahead KevinH if you feel like it, perhaps you could share here a skeleton of the class structure, so that other can comment / improve on it ?
fandrieu is offline   Reply With Quote
Old 09-16-2011, 08:48 AM   #205
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 7,630
Karma: 5433388
Join Date: Nov 2009
Device: many
Hi,

Quote:
Originally Posted by fandrieu View Post
About the refactor bit go ahead KevinH if you feel like it, perhaps you could share here a skeleton of the class structure, so that other can comment / improve on it ?
Actually classes have restarted and my teaching/research takes up most of my free time now so I was actually fishing for someone else to take over that duty!!!!

KevinH
KevinH is offline   Reply With Quote
Old 09-16-2011, 09:11 AM   #206
fandrieu
Member
fandrieu began at the beginning.
 
Posts: 11
Karma: 10
Join Date: Sep 2011
Device: kindle 3
As siebert suggested, I modified the code to use the IDXT data in INDX1 to "find" the entries.

Before we relied exclusively on the TAGX data and assumed that after having parsed an entry we would be correctly positioned at the start of the next one.

Now the offsets found in the IDXT are used to (i hope) accurately find each entry: there should be no more "positioning" bugs.

(note that to do that I chose to pass the whole INDX section to parseINDX1 (before it was only the "navdata" part) so that the offsets in IDXT are used verbatim...)

....

I also fixed "buildNCX" to correctly (i guess) set "playOrder" and "dtb:depth".

....

Also I didn't mention it before, but in my previous version I introduced some code to include the "filepos" found in INDX in the "anchor" algorithm.
Without that, if a link is found only in the NCX and not in the html itself, the corresponding anchor would be missing.

...

@KevinH: I'll try to look into it over the week end if possible....
Attached Files
File Type: zip mobiunpack_ncx_idxt.zip (16.9 KB, 217 views)
fandrieu is offline   Reply With Quote
Old 09-16-2011, 12:36 PM   #207
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,842
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
I just came across this thread. Some tips:

1) Do not use the INDX entries to build an NCX. INDX entries can have a maximum depth of two for books and 3 for periodicals. This is a limitation of the MOBI format. Instead parse the inline TOC, calculate the left indents and reconstruct the NCX from that. See code in mob/reader.py in calibre to do that.

2) If you still want to decompile indx entries and are looking at calibre code to figure out index entries:

a) note that currently indexer.py does not generate depth 2 indx entries for books, primarily because I got tired figuring out the TBS indexing for depth to book nodes.
b) you should look at the code in mobi/debug.py which is designed to decompile arbitrary MOBI files including the index and TBS information. You can run that code with calibre-debug --inspect-mobi filename.mobi
kovidgoyal is offline   Reply With Quote
Old 09-16-2011, 01:34 PM   #208
DiapDealer
Grand Sorcerer
DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.
 
DiapDealer's Avatar
 
Posts: 27,546
Karma: 193191846
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
Hi fandrieu,

I'm getting consistent results with many, many books with the latest script (from post #206), so thanks for your efforts!

I wanted to point out one potential problem area, though. Line 1036:
Code:
entry = re.sub('^', indent, entry, 0, re.M)
The above code will only work with python 2.7. If you want to include 2.6 and 2.5 users, consider replacing that line with this compatible code:
Code:
entry = re.sub(re.compile('^', re.M), indent, entry, 0)
DiapDealer is offline   Reply With Quote
Old 09-16-2011, 02:17 PM   #209
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 7,630
Karma: 5433388
Join Date: Nov 2009
Device: many
Hi,

And for my own personal sanity please change one other little thing:

print "Wite NCX"

to

print "Write NCX"

;-)

Thanks!

Kevin

Quote:
Originally Posted by DiapDealer View Post
Hi fandrieu,

I'm getting consistent results with many, many books with the latest script (from post #206), so thanks for your efforts!

I wanted to point out one potential problem area, though. Line 1036:
Code:
entry = re.sub('^', indent, entry, 0, re.M)
The above code will only work with python 2.7. If you want to include 2.6 and 2.5 users, consider replacing that line with this compatible code:
Code:
entry = re.sub(re.compile('^', re.M), indent, entry, 0)
KevinH is offline   Reply With Quote
Old 09-16-2011, 02:25 PM   #210
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 7,630
Karma: 5433388
Join Date: Nov 2009
Device: many
Hi Kovid,

Quote:
Originally Posted by kovidgoyal View Post
I just came across this thread. Some tips:
1) Do not use the INDX entries to build an NCX. INDX entries can have a maximum depth of two for books and 3 for periodicals. This is a limitation of the MOBI format. Instead parse the inline TOC, calculate the left indents and reconstruct the NCX from that. See code in mob/reader.py in calibre to do that.
I think the idea is to eventually use "mobiunpack.py" as a way for people to take mobi's generated by KindleGen, unpack them making the fewest changes as possible", Allow the user to make whatever changes they want and then pass the whole thing back through KindleGen to get back a mobi.

So I think the idea is to generate the NCX that is stored inside the mobi and pass it back in so that it get's regenerated in the exact same way.

Thus the idea to look at the internal ncx and not parse to create one of our own.

Quote:
2) If you still want to decompile indx entries and are looking at calibre code to figure out index entries:

a) note that currently indexer.py does not generate depth 2 indx entries for books, primarily because I got tired figuring out the TBS indexing for depth to book nodes.
Used your indexer.py code to verify what the tag values are and what they mean (parent, first_child, last_child, class, etc). Our code already handle's reading in depth 2 for ebooks (tested with books from Kindlegen, etc). But I have not tried it with a Periodical at all.

Quote:
b) you should look at the code in mobi/debug.py which is designed to decompile arbitrary MOBI files including the index and TBS information. You can run that code with calibre-debug --inspect-mobi filename.mobi
Great tool. Will do.

Thanks,

KevinH
KevinH is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Can i rotate text and insert images in Mobi and EPUB? JanGLi Kindle Formats 5 02-02-2013 04:16 PM
PDF to Mobi with text and images pocketsprocket Kindle Formats 7 05-21-2012 07:06 AM
Mobi files - images DWC Introduce Yourself 5 07-06-2011 01:43 AM
pdf to mobi... creating images rather than text Dumhed Calibre 5 11-06-2010 12:08 PM
Transfer of images on text files anirudh215 PDF 2 06-22-2009 09:28 AM


All times are GMT -4. The time now is 12:22 AM.


MobileRead.com is a privately owned, operated and funded community.