09-15-2011, 12:41 PM | #196 | |
Sigil Developer
Posts: 7,630
Karma: 5433388
Join Date: Nov 2009
Device: many
|
Quote:
If type & mask == mask should work for whatever the bitmask is assuming it is truly a mask (ie. that all bits set in the mask exist (are set) in the type value since & is a bitwise operator If the tagx bitmask has more than one bit set, then that is captured by the mask. What am I missing? |
|
09-15-2011, 12:50 PM | #197 | |
Developer
Posts: 155
Karma: 280
Join Date: Nov 2010
Device: Kindle 3 (Keyboard) 3G / iPad 9 WiFi / Google Pixel 6a (Android)
|
Quote:
With a two-bit mask the encoded values are 0, 1, 2 and 3 (but 3 means more than 2 and the real value is stored in a separate byte). The current code doesn't decode these values. Ciao, Steffen |
|
Advert | |
|
09-15-2011, 04:49 PM | #198 | |
Sigil Developer
Posts: 7,630
Karma: 5433388
Join Date: Nov 2009
Device: many
|
Quote:
I am not sure I understand If I assume the following (note tag1 has more than 1 bit set in its mask) tag 1 has bitmask 0x03 and requires 1 value be read in as field 1 tag 2 has bitmask 0x01 and requires 1 value be read in as field 2 tag 3 has bitmask 0x02 and requires 1 value be read in as field 3 tag 4 has bitmask 0x08 and requires 1 value to be read in as field 4 And if type == 0x07: I would read in the first value as field 1, next value as field 2, next value as field 3 and no further values would be read in for this particular entry since the bitmask & type != bitmask for tag 4. I think you are saying this is wrong ... I think you are saying that a bitmask with two bits set means something different from how I am interpreting it? If so, via a concrete example, could you explain how a bitmask with more than 1 bit set should be interpreted if it is not as I had assumed above. Thanks! Kevin |
|
09-15-2011, 05:20 PM | #199 | |
Developer
Posts: 155
Karma: 280
Join Date: Nov 2010
Device: Kindle 3 (Keyboard) 3G / iPad 9 WiFi / Google Pixel 6a (Android)
|
Quote:
Tag 1: 0x03 = 0b00000011 0x07 AND 0x03 = 0x03 This would mean that we have 3 values of tag 1. But as I've said, for multi-bit masks a result of all ones (like in this case) the real number can be anything > 2 and you have to read one byte (or a multibyte value, don't remember which one) to get the real number of tag 1. If type would be 0x06 instead of 0x07: 0x06 and 0x03 = 0x02 This would mean we have 2 values of tag 1 Tag2: I'm confused. The mask 0x01 collides with the mask 0x03. It's not possible to have a tag with mask 0x01 and another with mask 0x03 in the same control byte. The control byte works as follows. You have one byte (8 bits) and want to encode the number of tag values for several tags with these 8 bits. If a tag can occur only once, you need one bit. If it can occur several times, you need more bits. All masks I've seen so far had a maximum of 2 bits. Let's say you have 3 tags with one bit and one tag with two bits than you should see the following masks: 0b00000001 = 0x01 for tag1 0b00000010 = 0x02 for tag2 0b00000100 = 0x04 for tag3 0b00011000 = 0x18 for tag4 A control byte of 0x15 would then decode as: 0b00010101 1 * tag 1 0 * tag 2 1 * tag 3 2 * tag 4 I hope it's now clear what I mean. Ciao, Steffen |
|
09-15-2011, 06:52 PM | #200 | |
Sigil Developer
Posts: 7,630
Karma: 5433388
Join Date: Nov 2009
Device: many
|
Hi Steffen,
Yes, thanks! That is much clearer. An NCX entry can never have more than one parent, can only have one position, one class, one length, and although it could have many children, the children are actually indicated by two different values which provide a range - the first of which is the record number of the first child of this ncx entry and the second of which is the record number of the last child of this ncx entry (as a range). So it appears that 1 bit is only ever needed. I would guess your inflection dictionaries are much much more complicated than the ncx entries. So because of the structure of the fields, I believe that multi-bit masks as you describe below are never used. And I agree we should at least run a test and warn if multi-bit masks are ever found in the NCX code. Thanks! Kevin Quote:
|
|
Advert | |
|
09-15-2011, 07:22 PM | #201 | |
Grand Sorcerer
Posts: 27,546
Karma: 193191846
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
|
Quote:
Code:
if outncx: outncxbasename = os.path.basename(outncx) data += '<item id="ncx" media-type="application/x-dtbncx+xml" href="'+outncxbasename+'"></item>\n' data.append('</manifest>\n<spine toc="ncx">\n<itemref idref="item1"/>\n</spine>\n<tours>\n</tours>\n') Code:
if outncx: outncxbasename = os.path.basename(outncx) data += '<item id="ncx" media-type="application/x-dtbncx+xml" href="'+outncxbasename+'"></item>\n' data.append('</manifest>\n<spine toc="ncx">\n<itemref idref="item1"/>\n</spine>\n<tours>\n</tours>\n') else: data.append('</manifest>\n<spine>\n<itemref idref="item1"/>\n</spine>\n<tours>\n</tours>\n') |
|
09-15-2011, 08:37 PM | #202 |
Sigil Developer
Posts: 7,630
Karma: 5433388
Join Date: Nov 2009
Device: many
|
Hi All,
To prevent duplication of effort ... Does anyone here want to take a shot at refactoring and or adding classes. I would assume class-wise we could have an NCX related class, and OPF related class, and a Dictionary related classes (and the NCX and Dictionary could share and TAGX, INDX classes if need be) and try and clean up the code, shrink it wherever possible, and make sure the routines that obviously belong to a class get encapsulated by that class. Any Takers? KevinH Last edited by KevinH; 09-15-2011 at 10:52 PM. |
09-16-2011, 01:32 AM | #203 | |
Developer
Posts: 155
Karma: 280
Join Date: Nov 2010
Device: Kindle 3 (Keyboard) 3G / iPad 9 WiFi / Google Pixel 6a (Android)
|
Quote:
The getTagMap() function should already do everything needed to decode an index entry (including multi-bit masks), so I suggest using it also for ncx index handling. Ciao, Steffen |
|
09-16-2011, 06:03 AM | #204 | |||
Member
Posts: 11
Karma: 10
Join Date: Sep 2011
Device: kindle 3
|
Quote:
Quote:
I didn't you what was in there when I started this, if i understand you right it contains the position of the actual index entries in the INDX1, is that it ? If someon didn't already do it, I'll look into it... Quote:
That's what happens you release proof of concept code in a haste ... About the TAGX / mask stuff thanks for all your input and shaping this code into something correct, I knew too little about the mobi format to figure that out quickly... About the refactor bit go ahead KevinH if you feel like it, perhaps you could share here a skeleton of the class structure, so that other can comment / improve on it ? |
|||
09-16-2011, 08:48 AM | #205 | |
Sigil Developer
Posts: 7,630
Karma: 5433388
Join Date: Nov 2009
Device: many
|
Hi,
Quote:
KevinH |
|
09-16-2011, 09:11 AM | #206 |
Member
Posts: 11
Karma: 10
Join Date: Sep 2011
Device: kindle 3
|
As siebert suggested, I modified the code to use the IDXT data in INDX1 to "find" the entries.
Before we relied exclusively on the TAGX data and assumed that after having parsed an entry we would be correctly positioned at the start of the next one. Now the offsets found in the IDXT are used to (i hope) accurately find each entry: there should be no more "positioning" bugs. (note that to do that I chose to pass the whole INDX section to parseINDX1 (before it was only the "navdata" part) so that the offsets in IDXT are used verbatim...) .... I also fixed "buildNCX" to correctly (i guess) set "playOrder" and "dtb:depth". .... Also I didn't mention it before, but in my previous version I introduced some code to include the "filepos" found in INDX in the "anchor" algorithm. Without that, if a link is found only in the NCX and not in the html itself, the corresponding anchor would be missing. ... @KevinH: I'll try to look into it over the week end if possible.... |
09-16-2011, 12:36 PM | #207 |
creator of calibre
Posts: 43,842
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
I just came across this thread. Some tips:
1) Do not use the INDX entries to build an NCX. INDX entries can have a maximum depth of two for books and 3 for periodicals. This is a limitation of the MOBI format. Instead parse the inline TOC, calculate the left indents and reconstruct the NCX from that. See code in mob/reader.py in calibre to do that. 2) If you still want to decompile indx entries and are looking at calibre code to figure out index entries: a) note that currently indexer.py does not generate depth 2 indx entries for books, primarily because I got tired figuring out the TBS indexing for depth to book nodes. b) you should look at the code in mobi/debug.py which is designed to decompile arbitrary MOBI files including the index and TBS information. You can run that code with calibre-debug --inspect-mobi filename.mobi |
09-16-2011, 01:34 PM | #208 |
Grand Sorcerer
Posts: 27,546
Karma: 193191846
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
|
Hi fandrieu,
I'm getting consistent results with many, many books with the latest script (from post #206), so thanks for your efforts! I wanted to point out one potential problem area, though. Line 1036: Code:
entry = re.sub('^', indent, entry, 0, re.M) Code:
entry = re.sub(re.compile('^', re.M), indent, entry, 0) |
09-16-2011, 02:17 PM | #209 | |
Sigil Developer
Posts: 7,630
Karma: 5433388
Join Date: Nov 2009
Device: many
|
Hi,
And for my own personal sanity please change one other little thing: print "Wite NCX" to print "Write NCX" ;-) Thanks! Kevin Quote:
|
|
09-16-2011, 02:25 PM | #210 | |||
Sigil Developer
Posts: 7,630
Karma: 5433388
Join Date: Nov 2009
Device: many
|
Hi Kovid,
Quote:
So I think the idea is to generate the NCX that is stored inside the mobi and pass it back in so that it get's regenerated in the exact same way. Thus the idea to look at the internal ncx and not parse to create one of our own. Quote:
Quote:
Thanks, KevinH |
|||
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Can i rotate text and insert images in Mobi and EPUB? | JanGLi | Kindle Formats | 5 | 02-02-2013 04:16 PM |
PDF to Mobi with text and images | pocketsprocket | Kindle Formats | 7 | 05-21-2012 07:06 AM |
Mobi files - images | DWC | Introduce Yourself | 5 | 07-06-2011 01:43 AM |
pdf to mobi... creating images rather than text | Dumhed | Calibre | 5 | 11-06-2010 12:08 PM |
Transfer of images on text files | anirudh215 | 2 | 06-22-2009 09:28 AM |