MobileRead Forums - View Single Post - Kindlestrip Python script and AppleScript wrapper

KevinH · 12-06-2013, 10:55 AM

Hi All,

I have spent time reading the calibre source on apnx and the pages in the wiki and from usernone's work, and what was discovered by dilo_sec in this thread here: https://www.mobileread.com/forums/sho...5&postcount=45

So I spent some more time and think I have reached the point of understanding the page-map information more fully especially when a document uses more than one page numbering scheme.

Here is my analysis for the record in case anyone else is interested:

Code:

Actual epub page-map.xml

<page-map xmlns="http://www.idpf.org/2007/opf">
<page name="i"  href="chapter_01.html#page_i"/>
<page name="ii"  href="chapter_01.html#page_ii"/>
<page name="1"  href="chapter_01.html#page_1"/>
<page name="2"  href="chapter_01.html#page_2"/>
<page name="3"  href="chapter_01.html#page_3"/>
<page name="4"  href="chapter_01.html#page_4"/>
<page name="5"  href="chapter_01.html#page_5"/>
<page name="A-1"  href="chapter_01.html#page_A1"/>
<page name="A-2"  href="chapter_01.html#page_A2"/>
<page name="I-1" href="chapter_01.html#page_I1"/>
</page-map>


Kindlegen PAGE map info stored at the front of the SRCS section for both
Mobi 7 and Mobi 8 parts.   Below is the information from the Mobi 8 (KF8) PAGE information:

PAGE^@^@^@^H^@^A^@^A^@^@^@*^@^@^@^^{
   "fileRevisionId" : "1"
}
^@^A^@n^@
^@^P{
   "description" : "PageMap from source by kindlegen",
   "pageMap" : "(1,r,1),(3,a,1),(8,c,A-1|A-2|I-1)"
}
^C\236^F^L^H{
\257^L\304^Oi^Q\327^S]^V^_^X\201


Here is the Hex Representation of this Page info from the Mobi 8 part

87654321  0011 2233 4455 6677 8899 aabb ccdd eeff  0123456789abcdef                                                             
00000000: 5041 4745 0000 0008 0001 0001 0000 002a  PAGE...........*
00000010: 0000 001e 7b0a 2020 2022 6669 6c65 5265  ....{.   "fileRe
00000020: 7669 7369 6f6e 4964 2220 3a20 2231 220a  visionId" : "1".
00000030: 7d0a 0001 006e 000a 0010 7b0a 2020 2022  }....n....{.   "
00000040: 6465 7363 7269 7074 696f 6e22 203a 2022  description" : "
00000050: 5061 6765 4d61 7020 6672 6f6d 2073 6f75  PageMap from sou
00000060: 7263 6520 6279 206b 696e 646c 6567 656e  rce by kindlegen
00000070: 222c 0a20 2020 2270 6167 654d 6170 2220  ",.   "pageMap"
00000080: 3a20 2228 312c 722c 3129 2c28 332c 612c  : "(1,r,1),(3,a,
00000090: 3129 2c28 382c 632c 412d 317c 412d 327c  1),(8,c,A-1|A-2|
000000a0: 492d 3129 220a 7d0a 039e 060c 087b 0aaf  I-1)".}......{..
000000b0: 0cc4 0f69 11d7 135d 161f 1881            ...i...]....

Analysis
---------
00000000 - 0000000f  Section header PAGE
00000010 - 00000011  0
00000012 - 00000013  30: Length of rev string in bytes (Big Endian Half Word)
{
   "fileRevisionId" : "1"
}
00000032 - 00000033  1:  Always 1?

00000034 - 00000035  110: Length of PageMap in bytes (Big Endian Half Word)

00000036 - 00000037  10: Number of Page names (Big Endian Half Word)

00000038 - 00000039  16: Number of bits used in offsets to page href destination
					 - typically this is 32 (0x20) but my example was small
					 enough Kindlegen used only 16 bit offsets
					 
0000003A - 000000A7  PageMap showing a tupple for each numbering scheme used in the document with the following format:

		(entry_number, numbering_scheme, values) 

		where:
			- entry_number is which entry in page-map.xml (starting with 1)
			- numbering_scheme is c - character, r - roman, a - arabic
			- values is starting page number for "r" and "a" schemes otherwise
				it is a pipe-separated list "|" of page names
{
   "description" : "PageMap from source by kindlegen",
   "pageMap" : "(1,r,1),(3,a,1),(8,c,A-1|A-2|I-1)"
}


000000A8 - 000000BB Table of 16 bit offsets (see above for bit widths) into assembled text (Big Endian Half Words - 16 bits or Big Endian Words - 32bits)

0x039e - offset in bytes to page i anchor
0x060c - offset in bytes to page ii anchor
0x087b - offset in bytes to page 1 anchor
0x0aaf - offset in bytes to page 2 anchor
0x0cc4 - offset in bytes to page 3 anchor
0x0f69 - offset in bytes to page 4 anchor
0x11d7 - offset in bytes to page 5 anchor
0x135d - offset in bytes to page A-1 anchor
0x161f - offset in bytes to page A-2 anchor
0x1881 - offset in bytes to Page I-1 anchor

So I think I now understand it well enough to modify Kindlestrip to handle the inserted PAGE sections at the start of the SRCS section number. Right now kindlestrip will barf on these since they begin with PAGE and not SRCS (an easy fix).

More importantly, I think we could modify KindleUnpack to recreate the page-map.xml from Kindlegen generated joint mobis that have PAGE sections

or alternatively

possibly create the page-map.xml from the APNX file and AZW3 if someone had access to both.

Would any of this functionality be of interest to anyone in Kindlestrip or KindleUnpack?

Thanks,
KevinH