10-11-2019, 05:57 PM | #1 |
Grand Sorcerer
Posts: 5,278
Karma: 98804578
Join Date: Apr 2011
Device: pb360
|
On repairing defective apnx files.
(Please don't use this thread to vent that ebooks shouldn't have page numbers, what the numbering scheme should be, or other off topic posts.)
Three of the books that I've read recently with amazon supplied page numbers started out OK, but the page numbers started getting screwy near the end. My best guess is that books with extensive notes, bibliographies, etc are prone to having HREFs that look like page anchors to whatever tools publishers use to generate <pageList> sections in toc.ncx and that that screws that file up. Details: Spoiler:
I thought it might be possible to repair the apnx files, but that it would be difficult to figure out how, and tedious and time consuming to do. Then I saw post#2 by Doitsu in this thread: https://www.mobileread.com/forums/sh...d.php?t=255926 I don't think kindleunpack has an option to make an epub whose toc.ncx has a <pageList> section, but it turned out to be relatively easy to use kindleunpack -> (some regex and scripting) -> kindlegen -> kindleunpack to get repaired apnx files. The first step is to look at some of the Text/part0*.xhtml files to learn the form of page anchors used in the book. Next make a list of file name anchor id pairs and use that to generate a <pageList> section to insert ahead of the closing /ncx> in toc.ncx after removing anything fishy that might be in the list. Then use kindlegen on the augmented EPUB. The only thing you need from the fat mobi is the apnx file, which should be renamed to match the one supplied by amazon for the book. The really good news is that the new apnx file can be copied straight to the sdr directory for the book, overwriting the existing file. (I did this with the book closed on the kindle.) This doesn't seem to faze the kindle at all. The next time the book is opened, the page numbers are correct. Attached is a perl script to generate a <pageList> section from a list of file name page id pairs. Last edited by j.p.s; 01-26-2020 at 11:49 AM. |
10-26-2019, 01:29 PM | #2 |
Grand Sorcerer
Posts: 5,278
Karma: 98804578
Join Date: Apr 2011
Device: pb360
|
It was suggested to me privately that Doitsu's pagelist sigil plugin,https://www.mobileread.com/forums/sh...d.php?t=265237, might be useful in repairing books with page number problems.
I tried it on Bad Blood: Secrets and Lies in a Silicon Valley Startup, which uses code like Code:
<span id="page_3" epub:type="pagebreak" title="3"></span> Unfortunately, A Brief History of Everyone Who Ever Lived uses Code:
<a id="page_1"></a> Code:
<a id="page-1"></a> When compatible page number markup was used in the ebook production, the sigil pagelist plugin significantly simplifies apnx repair workflow to kindleunpack -> sigil pagelist -> kindlegen -> kindleupack (note that all that is required from all this is the replacement apnx file that can be copied to the sdr directory on the kindle, the original azw3 file does not need to be changed). |
10-26-2019, 08:29 PM | #3 |
Sigil Developer
Posts: 7,644
Karma: 5433388
Join Date: Nov 2009
Device: many
|
You could use kindleunpack to generate the Adobe pagemap.xml and use search on the custom form of id you want to verify and/or fix the pagemap.xml file. Then add it to the unpacked epub and pass it through kindlegen and strip out the apnx info into a separate file.
|
10-27-2019, 01:02 PM | #4 | |
Grand Sorcerer
Posts: 5,278
Karma: 98804578
Join Date: Apr 2011
Device: pb360
|
Quote:
But it does seem a bit ironic that the original apnx file isn't even required to make a good one. This implies that for books with page id targets transferred over USB, a "real page number" apnx file can be automatically generated without having to leave airplane mode to get one from amazon. |
|
11-02-2019, 04:45 PM | #5 | |
Grand Sorcerer
Posts: 5,278
Karma: 98804578
Join Date: Apr 2011
Device: pb360
|
Quote:
The 3 pairs of page-maps are attached as pagemaps.zip |
|
11-05-2019, 08:32 AM | #6 |
Sigil Developer
Posts: 7,644
Karma: 5433388
Join Date: Nov 2009
Device: many
|
Yes, the "bad" ones definitely look bad and the "fix" ones look much better. I am surprised as to why this happens as in older mobi 6 and mokbi 7 internally links are filepos info (file offsets) and in newer mobi8 they encode a base 32 file offset into a character based "id-like" equivalent. Both file offsets should be quite precise and not lead to what you are seeing.
Is it just moving in the wrong direction to get the exact link text? Are the "bad" and "fix" targets in any way close together? That is very strange. KevinH |
11-05-2019, 11:11 AM | #7 | |
Grand Sorcerer
Posts: 5,278
Karma: 98804578
Join Date: Apr 2011
Device: pb360
|
Quote:
I don't think the problem is with the conversion to KF. The bad apnx is from gabage in, garbage out. I don't have a way to get the commercial EPUBs, so I can't investigate further. (No account at EPUB retailer, library, etc. and unwillingness to have anything to do with Adobe.) |
|
11-09-2019, 09:35 AM | #8 | |
Grand Sorcerer
Posts: 5,278
Karma: 98804578
Join Date: Apr 2011
Device: pb360
|
Quote:
I still think this is triggered by books with extensive footnotes but have lost some confidence in that. Still pretty sure bogus pagelist or page-map in the publisher supplied epub is the cause, but have no way to check. |
|
12-07-2019, 02:43 PM | #9 | |
Grand Sorcerer
Posts: 5,278
Karma: 98804578
Join Date: Apr 2011
Device: pb360
|
Quote:
I paginated an EPUB by hand by inserting anchors based on a PDF scan of the book and generated a pagelist from a list of the anchors. The apnx files generated by running kindlegen on the EPUB and kindleupack on the kindlegen output points to the opening "<" of the anchor for both the mobi7 and mobi8 raw markup. (The mobi7 markup has empty <a ></a>.) I had not previously looked into the pagination for books that I did not notice any problem when reading. I wrote a script to dump the page table of offsets at the end of an apnx file and optionally 16 characters from the raw markup (assembled_text.dat) beginning at each offset. No commercial book perfectly matched at every page, but a few came close with a couple actually matching on almost every page. Some had small offsets, others larger. Sometimes the offset was not a fixed amount. A few did not have anchors or spans that indicated page boundaries, so I have no idea how accurate the apnx offsets are. I'm attaching apnx_dump.pl |
|
01-26-2020, 04:30 PM | #10 | |
Grand Sorcerer
Posts: 5,278
Karma: 98804578
Join Date: Apr 2011
Device: pb360
|
Quote:
I'm starting to think publishers are using custom apnx generators or modify apnx files for some reason. Sometimes there are no page targets present at all. When there are targets, I haven't been able to figure out whether the errors are systematic. The simplest apnx pageMap is something like "pageMap":"(1,a,1)". Slightly more complicated would be a leading number greater than 1 designating the number of 0 values at the beginning of the page list. The pageMap for Ready Player One and quite a few other books is like "pageMap":"(11,a,1),(386,c,|)". I think the last tuple is used to designate the end of page numbers. The pageMap in the apnx file for a book several chapters with odd last page numbers or other blank pages, might look like "pageMap":"(1,a,1),(11,a,12),(12,a,14),(167,a,170) ,(266,a,270),(268,a,273)" which is from the repaired Ready Player One apnx file (which has some blank odd pages). The empty pages are accommodated in the simple apnx files by using the same offset as for the previous page in the Page List table at the end of the apnx file. The amazon (or publisher?) supplied apnx file for A Brief History of Everyone Who Ever Lived has the simple "pageMap":"(17,a,1),(377,c,|)", even though the book has blank pages and roman numeral labeled pages, which do have roman numeral labeled link targets. The repaird apnx for A Brief History of Everyone Who Ever Lived is "pageMap":"(1,r,5),(11,a,1),(23,a,14),(222,a,214), (394,r,1),(396,a,402),(397,r,4)". It seems to be common for many of the books with simple pageMaps and roman numeral pages that do not show as such to be responsible for the empty Page List table entries at the top of the table. |
|
01-29-2020, 01:25 PM | #11 | |
Grand Sorcerer
Posts: 6,496
Karma: 84420419
Join Date: Nov 2011
Location: Tampa Bay, Florida
Device: Kindles
|
Quote:
As far as I can tell kindlegen builds PAGE records for MOBI7 and KF8 that accurately map the offset of the starting '<' of the element containing the ID in the raw HTML content for each page in the EPUB pageList/page-list. It appears that Amazon is manipulating that data in the publishing workflow when processing PAGE records to produce APNX files. Some Kindle reading apps/devices use the APNX position array, when an APNX is present, to calculate the percent complete shown to the user. The APNX position array often has a number of unlabeled zero entries added to the beginning. I suspect that Amazon adds these entries to account for the amount of front matter before the first assigned page number in order to have the percent shown come out more accurately. One or more extra entries are often added to the end of the APNX position array, mapped to empty labels. This may also have to do with percentage, but more importantly it prevents the final page number from being shown for the entire remainder of books that contain unnumbered back matter after the last numbered page. Why the positions are sometimes off by varying small amounts (tens of bytes) is harder to explain. Values appear to be adjusted to correspond to an explicit page break (such as <body> in KF8 or <mbp:pagebreak/> in MOBI7) or to a character of text that will be visible to the reader. My best guess is that Amazon is making adjustments that produce offsets that work better with the mapping that is done between equivalent positions of visible text in the MOBI7 and KF8 formats. This mapping is needed to make notes and highlights match exactly between formats. (Also, associating a printed page with a set of visible characters, rather than particular HTML markup, makes some logical sense.) If you have other ideas about what is going on during APNX generation I would be interested to hear them. Last edited by jhowell; 01-30-2020 at 05:39 AM. |
|
01-29-2020, 06:29 PM | #12 |
Grand Sorcerer
Posts: 5,278
Karma: 98804578
Join Date: Apr 2011
Device: pb360
|
Thanks for the helpful insights.
I'll try to write more this weekend. |
02-01-2020, 01:41 PM | #13 | ||||||
Grand Sorcerer
Posts: 5,278
Karma: 98804578
Join Date: Apr 2011
Device: pb360
|
Quote:
Quote:
Based on the APNX produced by kindlegen, I agree that those maps point to the starting '<' and that the APNX delivered by amazon are often offset by a few (or many) bytes. Sometimes an incorrect element seems to have been targeted. When I generated an EPUB from asciidoc marked up text which included page markers that I inserted myself, I noticed that page number entries in kindle GOTO dialogs showing the TOC were 1 too low and that the same would happen for the first page number in a chapter when reading the book. I had assumed that had to do with the asciidoc to epub conversion process, but I saw that it sometimes happened with amazon supplied APNX files. So I understand the motivation to manipulate the mapping. Since it was inconsistent, I assumed it was the publishers doing the manipulating. Quote:
I agree that one reason for zeroed entries at the beginning of the map is because page 1 is well into the book. Frequently the pbook has roman numerals for those pages and the ebook shows no label for those pages. But some amazon delivered APNX files do have roman numeral map entries. That is part of why I suspected it varied with publisher as opposed to amazon shenanigans, but it would not be surprising if amazon is inconsistent in its process. (Often the roman numeral page id elements are in the raw book HTML, but not in the amazon delivered APNX.) Quote:
Quote:
But in my very tiny sampling the incidence of glitches and outright SNAFU is very high. That's annoying, but the good news is that is possible for individuals to make the fix themselves. Quote:
|
||||||
02-02-2020, 09:11 AM | #14 | ||
Grand Sorcerer
Posts: 6,496
Karma: 84420419
Join Date: Nov 2011
Location: Tampa Bay, Florida
Device: Kindles
|
Quote:
When something goes wrong it is often hard to tell whether it is the publisher's fault or Amazon's. Quote:
Amazon's processing of books has some strange anomalies. I suspect that bugs in their handling of page numbers have gone unnoticed and unfixed because no one has been paying close attention to those details and it works well enough for them. |
||
Tags |
apnx |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Page number data in files other than apnx | j.p.s | Kindle Formats | 1 | 09-08-2019 09:06 PM |
apnx files to mobi on calibre | westhee | Conversion | 2 | 04-12-2013 04:28 PM |
APNX Files & Kindle Touch | 93terp | Devices | 4 | 12-16-2011 06:20 PM |
Kindle generated apnx files | Cassandra | Devices | 3 | 05-11-2011 04:54 PM |
is my jetbook defective or my files? | bookeater | Ectaco jetBook | 13 | 02-16-2010 02:57 PM |