On repairing defective apnx files.

j.p.s · 10-11-2019, 05:57 PM

(Please don't use this thread to vent that ebooks shouldn't have page numbers, what the numbering scheme should be, or other off topic posts.)

Three of the books that I've read recently with amazon supplied page numbers started out OK, but the page numbers started getting screwy near the end. My best guess is that books with extensive notes, bibliographies, etc are prone to having HREFs that look like page anchors to whatever tools publishers use to generate <pageList> sections in toc.ncx and that that screws that file up.

Details:

Spoiler:

I thought it might be possible to repair the apnx files, but that it would be difficult to figure out how, and tedious and time consuming to do. Then I saw post#2 by Doitsu in this thread: https://www.mobileread.com/forums/sh...d.php?t=255926

I don't think kindleunpack has an option to make an epub whose toc.ncx has a <pageList> section, but it turned out to be relatively easy to use kindleunpack -> (some regex and scripting) -> kindlegen -> kindleunpack to get repaired apnx files.

The first step is to look at some of the Text/part0*.xhtml files to learn the form of page anchors used in the book. Next make a list of file name anchor id pairs and use that to generate a <pageList> section to insert ahead of the closing /ncx> in toc.ncx after removing anything fishy that might be in the list. Then use kindlegen on the augmented EPUB. The only thing you need from the fat mobi is the apnx file, which should be renamed to match the one supplied by amazon for the book.

The really good news is that the new apnx file can be copied straight to the sdr directory for the book, overwriting the existing file. (I did this with the book closed on the kindle.) This doesn't seem to faze the kindle at all. The next time the book is opened, the page numbers are correct.

Attached is a perl script to generate a <pageList> section from a list of file name page id pairs.

j.p.s · 10-26-2019, 01:29 PM

It was suggested to me privately that Doitsu's pagelist sigil plugin,https://www.mobileread.com/forums/sh...d.php?t=265237, might be useful in repairing books with page number problems.

I tried it on Bad Blood: Secrets and Lies in a Silicon Valley Startup, which uses code like

Code:

<span id="page_3" epub:type="pagebreak" title="3"></span>

for pages and that worked.

Unfortunately, A Brief History of Everyone Who Ever Lived uses

Code:

<a id="page_1"></a>

and Utopia for Realists uses

Code:

<a id="page-1"></a>

and the pagelist plugin does not work for either.

When compatible page number markup was used in the ebook production, the sigil pagelist plugin significantly simplifies apnx repair workflow to kindleunpack -> sigil pagelist -> kindlegen -> kindleupack (note that all that is required from all this is the replacement apnx file that can be copied to the sdr directory on the kindle, the original azw3 file does not need to be changed).

KevinH · 10-26-2019, 08:29 PM

You could use kindleunpack to generate the Adobe pagemap.xml and use search on the custom form of id you want to verify and/or fix the pagemap.xml file. Then add it to the unpacked epub and pass it through kindlegen and strip out the apnx info into a separate file.

j.p.s · 10-27-2019, 01:02 PM

Quote:

Originally Posted by KevinH

You could use kindleunpack to generate the Adobe pagemap.xml and use search on the custom form of id you want to verify and/or fix the pagemap.xml file. Then add it to the unpacked epub and pass it through kindlegen and strip out the apnx info into a separate file.

Thanks, that might lead to an optimal workflow for apnx repair. I'm currently having trouble with kindleunpack best dealt with in the kindleunpack thead, but I'm tied up on a bunch of other things at the moment.

But it does seem a bit ironic that the original apnx file isn't even required to make a good one. This implies that for books with page id targets transferred over USB, a "real page number" apnx file can be automatically generated without having to leave airplane mode to get one from amazon.

j.p.s · 11-02-2019, 04:45 PM

Quote:

Originally Posted by KevinH

You could use kindleunpack to generate the Adobe pagemap.xml and use search on the custom form of id you want to verify and/or fix the pagemap.xml file. Then add it to the unpacked epub and pass it through kindlegen and strip out the apnx info into a separate file.

I've used kindleunpack on each of the azw3 files, once with the original apnx file and once with the repaired apnx file that I generated with kindlegen on epubs with a pagelist appended to the toc.ncx. I think the resulting page-map.xml files support my speculation that page information supplied by the publishers had multiple errors due to misidentified HTML link targets as page IDs.

The 3 pairs of page-maps are attached as pagemaps.zip

KevinH · 11-05-2019, 08:32 AM

Yes, the "bad" ones definitely look bad and the "fix" ones look much better. I am surprised as to why this happens as in older mobi 6 and mokbi 7 internally links are filepos info (file offsets) and in newer mobi8 they encode a base 32 file offset into a character based "id-like" equivalent. Both file offsets should be quite precise and not lead to what you are seeing.

Is it just moving in the wrong direction to get the exact link text? Are the "bad" and "fix" targets in any way close together?

That is very strange.

KevinH

j.p.s · 11-05-2019, 11:11 AM

Quote:

Originally Posted by KevinH

Yes, the "bad" ones definitely look bad and the "fix" ones look much better. I am surprised as to why this happens as in older mobi 6 and mokbi 7 internally links are filepos info (file offsets) and in newer mobi8 they encode a base 32 file offset into a character based "id-like" equivalent. Both file offsets should be quite precise and not lead to what you are seeing.

Is it just moving in the wrong direction to get the exact link text? Are the "bad" and "fix" targets in any way close together?

That is very strange.

KevinH

Hence my theory that the publisher generated EPUBs have faulty page-map or pageList. Whatever automated tools they use must have been developed on plain books without much in the way of internal links and didn't get tested on books that do. Maybe they should contract with DoItSu to supply a generalized version of his sigil plugin.

I don't think the problem is with the conversion to KF. The bad apnx is from gabage in, garbage out.

I don't have a way to get the commercial EPUBs, so I can't investigate further. (No account at EPUB retailer, library, etc. and unwillingness to have anything to do with Adobe.)

j.p.s · 11-09-2019, 09:35 AM

Quote:

Originally Posted by KevinH

Yes, the "bad" ones definitely look bad and the "fix" ones look much better. I am surprised as to why this happens as in older mobi 6 and mokbi 7 internally links are filepos info (file offsets) and in newer mobi8 they encode a base 32 file offset into a character based "id-like" equivalent. Both file offsets should be quite precise and not lead to what you are seeing.

Is it just moving in the wrong direction to get the exact link text? Are the "bad" and "fix" targets in any way close together?

That is very strange.

KevinH

I've spent some time comparing the "bad" and "fix" page-map.xml with each other, the part0*.xhtml, and assembledtext.dat files. I thought it would be easy to check a few of the references that don't match the pattern for the page number ids and see where they are compared to the actual page references. It turned out that I couldn't find any of them, so I guess that the bogus apnx files as delivered by amazon somehow cause kindleunpack to synthesize them.

I still think this is triggered by books with extensive footnotes but have lost some confidence in that. Still pretty sure bogus pagelist or page-map in the publisher supplied epub is the cause, but have no way to check.

j.p.s · 12-07-2019, 02:43 PM

Quote:

Originally Posted by KevinH

Yes, the "bad" ones definitely look bad and the "fix" ones look much better. I am surprised as to why this happens as in older mobi 6 and mokbi 7 internally links are filepos info (file offsets) and in newer mobi8 they encode a base 32 file offset into a character based "id-like" equivalent. Both file offsets should be quite precise and not lead to what you are seeing.

Is it just moving in the wrong direction to get the exact link text? Are the "bad" and "fix" targets in any way close together?

That is very strange.

KevinH

I've played with this some more as I get bits of time and better understanding of apnx.

I paginated an EPUB by hand by inserting anchors based on a PDF scan of the book and generated a pagelist from a list of the anchors. The apnx files generated by running kindlegen on the EPUB and kindleupack on the kindlegen output points to the opening "<" of the anchor for both the mobi7 and mobi8 raw markup. (The mobi7 markup has empty <a ></a>.)

I had not previously looked into the pagination for books that I did not notice any problem when reading. I wrote a script to dump the page table of offsets at the end of an apnx file and optionally 16 characters from the raw markup (assembled_text.dat) beginning at each offset. No commercial book perfectly matched at every page, but a few came close with a couple actually matching on almost every page. Some had small offsets, others larger. Sometimes the offset was not a fixed amount. A few did not have anchors or spans that indicated page boundaries, so I have no idea how accurate the apnx offsets are.

I'm attaching apnx_dump.pl

j.p.s · 01-26-2020, 04:30 PM

Quote:

Originally Posted by KevinH

Yes, the "bad" ones definitely look bad and the "fix" ones look much better. I am surprised as to why this happens as in older mobi 6 and mokbi 7 internally links are filepos info (file offsets) and in newer mobi8 they encode a base 32 file offset into a character based "id-like" equivalent. Both file offsets should be quite precise and not lead to what you are seeing.

Is it just moving in the wrong direction to get the exact link text? Are the "bad" and "fix" targets in any way close together?

That is very strange.

KevinH

I'm gradually learning more as I look at more books and get more experience.

I'm starting to think publishers are using custom apnx generators or modify apnx files for some reason. Sometimes there are no page targets present at all. When there are targets, I haven't been able to figure out whether the errors are systematic.

The simplest apnx pageMap is something like "pageMap":"(1,a,1)". Slightly more complicated would be a leading number greater than 1 designating the number of 0 values at the beginning of the page list. The pageMap for Ready Player One and quite a few other books is like "pageMap":"(11,a,1),(386,c,|)". I think the last tuple is used to designate the end of page numbers.

The pageMap in the apnx file for a book several chapters with odd last page numbers or other blank pages, might look like "pageMap":"(1,a,1),(11,a,12),(12,a,14),(167,a,170) ,(266,a,270),(268,a,273)" which is from the repaired Ready Player One apnx file (which has some blank odd pages). The empty pages are accommodated in the simple apnx files by using the same offset as for the previous page in the Page List table at the end of the apnx file.

The amazon (or publisher?) supplied apnx file for A Brief History of Everyone Who Ever Lived has the simple "pageMap":"(17,a,1),(377,c,|)", even though the book has blank pages and roman numeral labeled pages, which do have roman numeral labeled link targets. The repaird apnx for A Brief History of Everyone Who Ever Lived is "pageMap":"(1,r,5),(11,a,1),(23,a,14),(222,a,214), (394,r,1),(396,a,402),(397,r,4)". It seems to be common for many of the books with simple pageMaps and roman numeral pages that do not show as such to be responsible for the empty Page List table entries at the top of the table.

jhowell · 01-29-2020, 01:25 PM

Quote:

Originally Posted by j.p.s

I'm starting to think publishers are using custom apnx generators or modify apnx files for some reason. Sometimes there are no page targets present at all. When there are targets, I haven't been able to figure out whether the errors are systematic.

I doubt the publishers have anything to do with the APNX other than providing the NCX pageList or NAV page-list used by kindlegen.

As far as I can tell kindlegen builds PAGE records for MOBI7 and KF8 that accurately map the offset of the starting '<' of the element containing the ID in the raw HTML content for each page in the EPUB pageList/page-list. It appears that Amazon is manipulating that data in the publishing workflow when processing PAGE records to produce APNX files.

Some Kindle reading apps/devices use the APNX position array, when an APNX is present, to calculate the percent complete shown to the user. The APNX position array often has a number of unlabeled zero entries added to the beginning. I suspect that Amazon adds these entries to account for the amount of front matter before the first assigned page number in order to have the percent shown come out more accurately.

One or more extra entries are often added to the end of the APNX position array, mapped to empty labels. This may also have to do with percentage, but more importantly it prevents the final page number from being shown for the entire remainder of books that contain unnumbered back matter after the last numbered page.

Why the positions are sometimes off by varying small amounts (tens of bytes) is harder to explain. Values appear to be adjusted to correspond to an explicit page break (such as <body> in KF8 or <mbp:pagebreak/> in MOBI7) or to a character of text that will be visible to the reader. My best guess is that Amazon is making adjustments that produce offsets that work better with the mapping that is done between equivalent positions of visible text in the MOBI7 and KF8 formats. This mapping is needed to make notes and highlights match exactly between formats. (Also, associating a printed page with a set of visible characters, rather than particular HTML markup, makes some logical sense.)

If you have other ideas about what is going on during APNX generation I would be interested to hear them.

j.p.s · 01-29-2020, 06:29 PM

Thanks for the helpful insights.

I'll try to write more this weekend.

j.p.s · 02-01-2020, 01:41 PM

Quote:

Originally Posted by jhowell

I doubt the publishers have anything to do with the APNX other than providing the NCX pageList or NAV page-list used by kindlegen.

Fair enough. I know nothing about publisher or amazon workflow, where the boundaries are, or whether the boundaries can shift.

Quote:

As far as I can tell kindlegen builds PAGE records for MOBI7 and KF8 that accurately map the offset of the starting '<' of the element containing the ID in the raw HTML content for each page in the EPUB pageList/page-list. It appears that Amazon is manipulating that data in the publishing workflow when processing PAGE records to produce APNX files.

I am also completely ignorant of the details of kindle format internals or even palmdb internals. I guess I should start picking that up. I've been assuming that kindlegen embeds apnx files into the mobi it produces, but I guess that is just as false as thinking that it embeds an EPUB (neglecting "append source").

Based on the APNX produced by kindlegen, I agree that those maps point to the starting '<' and that the APNX delivered by amazon are often offset by a few (or many) bytes. Sometimes an incorrect element seems to have been targeted.

When I generated an EPUB from asciidoc marked up text which included page markers that I inserted myself, I noticed that page number entries in kindle GOTO dialogs showing the TOC were 1 too low and that the same would happen for the first page number in a chapter when reading the book. I had assumed that had to do with the asciidoc to epub conversion process, but I saw that it sometimes happened with amazon supplied APNX files. So I understand the motivation to manipulate the mapping. Since it was inconsistent, I assumed it was the publishers doing the manipulating.

Quote:

Some Kindle reading apps/devices use the APNX position array, when an APNX is present, to calculate the percent complete shown to the user. The APNX position array often has a number of unlabeled zero entries added to the beginning. I suspect that Amazon adds these entries to account for the amount of front matter before the first assigned page number in order to have the percent shown come out more accurately.

When I first started paying attention to page numbers on kindles, I wondered whether page numbers would be used for % in book, but as near as I can tell on my kindles, "location" is still used. I've posted elsewhere about books with extensive heavily formatted end matter where 50% to 70% is shown at the end of the last chapter but where the page number is a much higher percentage of the total pages.

I agree that one reason for zeroed entries at the beginning of the map is because page 1 is well into the book. Frequently the pbook has roman numerals for those pages and the ebook shows no label for those pages. But some amazon delivered APNX files do have roman numeral map entries. That is part of why I suspected it varied with publisher as opposed to amazon shenanigans, but it would not be surprising if amazon is inconsistent in its process. (Often the roman numeral page id elements are in the raw book HTML, but not in the amazon delivered APNX.)

Quote:

One or more extra entries are often added to the end of the APNX position array, mapped to empty labels. This may also have to do with percentage, but more importantly it prevents the final page number from being shown for the entire remainder of books that contain unnumbered back matter after the last numbered page.

One of the first problems I noticed with amazon page numbers was a "final" page number that showed for the rest of the book. It stuck out like a sore thumb in the "GOTO" TOC where there were quite a few TOC entries with that "final" page number. I assumed it was a glitch in the (presumably epub) book source, but it turned out that putting a proper pagelist or page-map in the kindleunpack generated epub, feeding that to kindlegen, extracting the apnx, and using that with the original amazon supplied azw3 fixed page number display for that book.

Quote:

Why the positions are sometimes off by varying small amounts (tens of bytes) is harder to explain. Values appear to be adjusted to correspond to an explicit page break (such as <body> in KF8 or <mbp:pagebreak/> in MOBI7) or to a character of text that will be visible to the reader. My best guess is that Amazon is making adjustments that produce offsets that work better with the mapping that is done between equivalent positions of visible text in the MOBI7 and KF8 formats. This mapping is needed to make notes and highlights match exactly between formats. (Also, associating a printed page with a set of visible characters, rather than particular HTML markup, makes some logical sense.)

Could be, and I am fine with that if true.

But in my very tiny sampling the incidence of glitches and outright SNAFU is very high. That's annoying, but the good news is that is possible for individuals to make the fix themselves.

Quote:

If you have other ideas about what is going on during APNX generation I would be interested to hear them.

I seem to be better at detecting the problems and coming up with strategies to fix than determining why they happened in the first place, but I certainly welcome the discussion and will try to contribute when I can.

jhowell · 02-02-2020, 09:11 AM

Quote:

Originally Posted by j.p.s

I know nothing about publisher or amazon workflow, where the boundaries are, or whether the boundaries can shift.

Beside manipulating page numbers, Amazon also make changes to the location where books open for the first time and often removes fonts embedded by the publisher. Conversion to KFX format makes extensive formatting changes. There may be other changes made that I am unaware of.

When something goes wrong it is often hard to tell whether it is the publisher's fault or Amazon's.

Quote:

Originally Posted by j.p.s

(Often the roman numeral page id elements are in the raw book HTML, but not in the amazon delivered APNX.)

It may be that the publisher labeled those pages in the HTML content but failed to include them in the pagelist. Or Amazon may have removed them for some unknown reason. I don't know of any way to tell.

Quote:

Originally Posted by j.p.s

But in my very tiny sampling the incidence of glitches and outright SNAFU is very high. That's annoying, but the good news is that is possible for individuals to make the fix themselves.

Amazon's processing of books has some strange anomalies. I suspect that bugs in their handling of page numbers have gone unnoticed and unfixed because no one has been paying close attention to those details and it works well enough for them.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Page number data in files other than apnx	j.p.s	Kindle Formats	1	09-08-2019 09:06 PM
apnx files to mobi on calibre	westhee	Conversion	2	04-12-2013 04:28 PM
APNX Files & Kindle Touch	93terp	Devices	4	12-16-2011 06:20 PM
Kindle generated apnx files	Cassandra	Devices	3	05-11-2011 04:54 PM
is my jetbook defective or my files?	bookeater	Ectaco jetBook	13	02-16-2010 02:57 PM

10-26-2019, 01:29 PM	#2
j.p.s Grand Sorcerer Posts: 5,278 Karma: 98804578 Join Date: Apr 2011 Device: pb360	It was suggested to me privately that Doitsu's pagelist sigil plugin,https://www.mobileread.com/forums/sh...d.php?t=265237, might be useful in repairing books with page number problems. I tried it on Bad Blood: Secrets and Lies in a Silicon Valley Startup, which uses code like Code: <span id="page_3" epub:type="pagebreak" title="3"></span> for pages and that worked. Unfortunately, A Brief History of Everyone Who Ever Lived uses Code: <a id="page_1"></a> and Utopia for Realists uses Code: <a id="page-1"></a> and the pagelist plugin does not work for either. When compatible page number markup was used in the ebook production, the sigil pagelist plugin significantly simplifies apnx repair workflow to kindleunpack -> sigil pagelist -> kindlegen -> kindleupack (note that all that is required from all this is the replacement apnx file that can be copied to the sdr directory on the kindle, the original azw3 file does not need to be changed).

10-26-2019, 08:29 PM	#3
KevinH Sigil Developer Posts: 7,644 Karma: 5433388 Join Date: Nov 2009 Device: many	You could use kindleunpack to generate the Adobe pagemap.xml and use search on the custom form of id you want to verify and/or fix the pagemap.xml file. Then add it to the unpacked epub and pass it through kindlegen and strip out the apnx info into a separate file.

11-05-2019, 08:32 AM	#6
KevinH Sigil Developer Posts: 7,644 Karma: 5433388 Join Date: Nov 2009 Device: many	Yes, the "bad" ones definitely look bad and the "fix" ones look much better. I am surprised as to why this happens as in older mobi 6 and mokbi 7 internally links are filepos info (file offsets) and in newer mobi8 they encode a base 32 file offset into a character based "id-like" equivalent. Both file offsets should be quite precise and not lead to what you are seeing. Is it just moving in the wrong direction to get the exact link text? Are the "bad" and "fix" targets in any way close together? That is very strange. KevinH

01-29-2020, 06:29 PM	#12
j.p.s Grand Sorcerer Posts: 5,278 Karma: 98804578 Join Date: Apr 2011 Device: pb360	Thanks for the helpful insights. I'll try to write more this weekend.