02-02-2014, 04:06 PM | #1 |
Enthusiast
Posts: 40
Karma: 10
Join Date: Feb 2014
Device: Kindle 4
|
Real Page Numbers
I've been exploring the apnx generator and really like how I can now get page numbers in my Kindle instead of just location numbers.
However, while the estimated page numbers are fine most of the time, as an academic its sometimes important that I know exactly which page I'm on when constructing a citation reference. Obviously, to do this would require some manual editing of the ebook to mark where pages start. That's obviously a lot of work, but I only need to do it for a limited number of books so I consider it a reasonable trade off in some circumstances. To that end, I'd like some feedback on how to make this work. My thoughts are thus:
On a related note, does anyone know how the apnx files handle pages in the front matter which are numbered with roman numerals and then the page count resetting when the main matter of the book starts? I should note that I can program in python, and thus could make the necessary code modifications myself to apnx.py. However, I don't know how to integrate those changes into the user interface of calibre. My coding work has all been for people who can read and manipulate source code. I've never worried about a user interface before (beyond simple raw_input/input prompts). Thus while I'm perfectly willing to do the under the hood work I'll need some help getting it integrated. |
02-02-2014, 04:29 PM | #2 |
Grand Sorcerer
Posts: 6,205
Karma: 16228558
Join Date: Sep 2009
Location: UK
Device: Kobo: KA1, ClaraHD, Forma, Libra2, Clara2E. PocketBook: TouchHD3
|
I know nothing about Kindle apnx files, but I often see html markup something like
<a id="p100" /> or <a id="pviii"></a> in retail epubs. The markup is not visible when reading. There is also occasionally an xml page-map file in epubs. I know very little about them other than I had to remove them when I was reading on an old Sony device because they caused me problems. Hopefully someone with better knowledge can add more. Edit: An epub page-map looks something like this Code:
<?xml version='1.0' encoding='utf-8'?> <page-map xmlns="http://www.idpf.org/2007/opf"> <page href="OEBPS/copyright.html#piii" name="piii"/> <page href="OEBPS/copyright.html#piv" name="piv"/> <page href="OEBPS/preface001.html#pi" name="pi"/> <page href="OEBPS/ad-card.html#pii" name="pii"/> <page href="OEBPS/dedication.html#pv" name="pv"/> <page href="OEBPS/acknowledgements.html#pvi" name="pvi"/> <page href="OEBPS/acknowledgements.html#pvii" name="pvii"/> <page href="OEBPS/part001.html#pviii" name="pviii"/> <page href="OEBPS/part001.html#p1" name="p1"/> <page href="OEBPS/chapter003.html#p2" name="p2"/> <page href="OEBPS/chapter003.html#p3" name="p3"/> <page href="OEBPS/chapter003.html#p4" name="p4"/> etc ... etc ... </page-map> Last edited by jackie_w; 02-02-2014 at 04:33 PM. Reason: more info |
Advert | |
|
02-02-2014, 09:18 PM | #3 |
creator of calibre
Posts: 43,835
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
You dont need to use tags for this, you can use data- prefixed attributes. They are ignored by renderers.
The proper solution is of course to reverse engineer whatever facility amazon uses for real page numbers in apnx by buying a few azw3 books with real page numbers, but that will likely be a lot of effort. Once the reverse enginnering is done you can use the pagelist technology from epub and map it into the equivalent structure in azw3. This is asssuming the azw3 version is in file and not in a sidecar file. |
02-07-2014, 10:06 AM | #4 |
Enthusiast
Posts: 40
Karma: 10
Join Date: Feb 2014
Device: Kindle 4
|
Correct me if I'm wrong, but isn't an attribute a property of a tag? I.e. I can't just put 'data-page="1"' in the text of the file (it would be treated as text if I did) but must put something like '<wbr data-page="1">'. Now, I'll grant you that if every page break occurred at the start of a new element (heading, paragraph, etc.), one could simply add that attribute to the appropriate opening element tag, but page breaks often occur in the middle of a paragraph element where there is no existing tag to attach the element to. I would thus need to introduce a tag in those locations. Further, I would argue that for consistency sake it would be better if all page break locations, not just those in the middle of a paragraph were marked by the same element. This makes them easier to find in a human-readable fashion.
In researching the data- attribute (which I hadn't heard of before) I discovered the wordbreak (wbr) tag, which I think is a good candidate for marking page locations (hence my use of it above). It's a void element, and thus doesn't require a companion closing tag (unlike an anchor (a) tag). It is a new tag to HTML5 and is intended for marking line break opportunities in really long words. For both reasons, it should be unlikely to appear in most books. My quick testing shows that it is a tag which is preserved in azw3 and it doesn't affect the viewing of the document. Of course, that's if the reverse engineering process doesn't pan out. A quick search on amazon found that they do have at least some books for free with real page numbers. Not anything I would normally want to read, but then that isn't the purpose here. I haven't had the chance to "buy" them yet to discover what their file format is (amazon doesn't list the file format in the item description), but hopefully there's enough to find some in azw3 format. I'll start looking for that this weekend, hopefully. |
02-07-2014, 05:25 PM | #5 |
Enthusiast
Posts: 40
Karma: 10
Join Date: Feb 2014
Device: Kindle 4
|
So, I got around to checking on those files a bit sooner than I thought I would and found that like the .mobi format (which isn't editable) the .azw3 format (which is) uses the side-along .apnx file to mark page numbers. Further, if you open an .azw3 file to edit it, there is nothing in the file that marks the pagination. Amazon must have some other way of producing the .apnx file. Obviously at some point someone has to match places in the text with the beginnings of pages to produce the .apnx file, but that work is not done in the .azw3 file (or if it is, it's stripped out by amazon before the book ships or by calibre when it opens the book for editing).
So, I think I'm back to my original plan (manually mark the page breaks in the text and then use a modified apnx.py to create the side-along file). On a related note, I've noticed that apnx.py only works on .mobi formats, not .azw3. So making this work will also involve modifying it to accept a new format. |
Advert | |
|
02-07-2014, 05:48 PM | #6 | |
Ex-Helpdesk Junkie
Posts: 19,422
Karma: 85397180
Join Date: Nov 2012
Location: The Beaten Path, USA, Roundworld, This Side of Infinity
Device: Kindle Touch fw5.3.7 (Wifi only)
|
According to APNX#Kindle_publishing:
Quote:
|
|
02-07-2014, 10:12 PM | #7 |
creator of calibre
Posts: 43,835
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
You cannot use calibre to check if an azw3 file has page information, since calibre knows nothing about page information in azw3, it will just discard any such data present in the azw3. You would need to dump all records in the azw3 using
calibre-debug file.azw3 or the kindleunpack program then use a hex editor to examine any records that look like they contain page information and reverse engineer them. I'd start with the PAGE record. However, since amazon appears to strip the PAGE record from books it delivers to devices, it seems likely that the actual Kindles wont use them. SO even if you figure out how to create them, you would then need to modify the apnx generator code in calibre to strip tham and convert them to apnx when sending the azw3 files to the kindle. |
02-08-2014, 03:21 PM | #8 |
Enthusiast
Posts: 40
Karma: 10
Join Date: Feb 2014
Device: Kindle 4
|
Hmm... Reverse engineering from a hex representation is beyond my immediate abilities and I don't have the time at the moment to learn.
However, looking at the information in the wiki (and pages it links to), I might have a close substitute (based, it seems, largely on how ePub does it). Within the document, pages are marked as follows: <span epub:type="pagebreak" id="page_ii" title="ii"/> Problem: When an .azw3 file is saved and then reopened by calibre the ":" character in the first attribute is converted to "U0003A". Not sure what, if any, effect this will have. Fixing it probably involves modifying the .azw3 encoder/decoder to recognize the ":" character as valid within the span tag (at least within this context). For compatibility with KindleGen, a page_map.xml file would have be added to the book (just like an ePub does). Given the above markers in the text, a script could easily be written that would generate this file automatically. Problem: When an .azw3 file is saved and then reopened by calibre currently this file is lost. Again, fixing this probably involves modifying the .azw3 encoder/decoder to recognize this as a valid element to the file. Also, an ePub would normally add a dc:source element to the document metadata to indicate the print source. Presumably KindleGen needs something similar, but I cannot find anything specifically about this. In any case, calibre currently will not retain such an element in the metadata.opf resource of an .azw3 file. However, since I'm not looking to push my documents through amazon publishing (and thus not using KindleGen), I don't think I actually need to deal with these problems. All I needed is some way to mark the page breaks that apnx.py can be programed to recognize. If I use the ePub page break marker, I can write a script that looks for it in the document as it is actually remembered and use those locations to generate the .apnx file. The ePub marker meets the criterion that I was looking for earlier. It's invisible, and thus doesn't affect rendering of the book; the presence of "pagebreak" in it makes it easily identifiable and otherwise unlikely to appear in a book; and the id (and the title) are human readable for editing purposes. Further, should someone else be motivated to actually address the problems I identified above (making the process compatible with KindleGen and thus enabling publishing with real page numbers through amazon) a simple find and replace can correct the tag, somewhat future proofing this modification. Comments on this idea? |
02-08-2014, 10:17 PM | #9 |
creator of calibre
Posts: 43,835
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
IIRC epub:type="pagebreak" is an epub3 specific extension. Currently, almost nothing supports it.
The colon refers to a XML namespace. If you want to use it, you have to declare the epub namespace and make sure the document you are modifying is valid XML. The IDPF just likes to make everyone's life harder by using XHTML instead of plain HTML 5. Other than that, it's fine, although note that inserting an empty span tag into a document can have side effects, since the document can use CSS selectors based on tag counts. As I said before, the only sure way of modifying the document with no side effects is to use data- attributes. But that hhas the limitation of restricting page markers to existing tag locations. |
02-09-2014, 08:56 PM | #10 |
US Navy, Retired
Posts: 9,864
Karma: 13806776
Join Date: Feb 2009
Location: North Carolina
Device: Icarus Illumina XL HD, Nexus 7
|
Moderator Notice
I failed to comprehend what I read and previously moved this thread out of the development forum. Upon review I was wrong. The thread has been restored to its original location. |
02-14-2014, 12:01 PM | #11 | |||
Enthusiast
Posts: 40
Karma: 10
Join Date: Feb 2014
Device: Kindle 4
|
Quote:
Quote:
Unfortunately the editor currently doesn't know what to do with this declaration though. If added to the metadata tag in the metadata file (where several other namespaces are declared) then the declaration is lost in a save/close/reopen cycle. Uses of the namespace in the text files are unaffected (: gets converted to u0003a). If I try to use one of the other namespaces that are declared in the same place (dc, opf, calibre) the character swap still happens. If I declare the epub namespace within the html tag of a text document, then the declaration is removed and the name space is stripped from the tags where it is used (i.e. epub:type="pagebreak" becomes type="pagebreak"). This behavior is all specific to editing azw3 files, editing ePub's exhibit none of these behaviors (ePub's even retain the : when the namespace hasn't been declared). As for the file being valid XML, isn't that a given? I understood azw3 to be an amazon specific compilation of ePub. Since ePub files have to be valid XML (or more specifically XHTML) shouldn't an azw3 file be valid XML? Am I missing something? Quote:
|
|||
02-14-2014, 11:03 PM | #12 | ||||
creator of calibre
Posts: 43,835
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Quote:
Quote:
Quote:
Quote:
|
||||
02-21-2014, 04:12 PM | #13 | |||
Enthusiast
Posts: 40
Karma: 10
Join Date: Feb 2014
Device: Kindle 4
|
Quote:
Quote:
Quote:
Anyway, it looks like I've got enough information to get to work now. |
|||
05-08-2014, 05:50 PM | #14 |
Enthusiast
Posts: 40
Karma: 10
Join Date: Feb 2014
Device: Kindle 4
|
Okay, so I finally got around to doing something here and have come up with something that appears to work for me.
What I ended up doing was as follows:
Attached are my version of apnx.py (zipped up) and a book in azw3 format with the pages already marked. Edit: I've now attached a new book which I believe to be out of copyright. Published in 1926, the original author died in 1868 and the translator died in 1902. It has 93 pages of about 27 lines of ~47 characters. The book has also has 14 pages of front matter which are not included in the page count. I've only marked the pages in the main body, so you should get 93 pages using my code with page 1 occurring after the table of contents. I'd appreciate it if others could test it out and provide advice on the implementation. *I should note that the "<mbp:pagebreak/>" tag suffers from the same problem with the colon being replaced by u0003a that I described earlier. Copyrighted material may not be posted on Mobileread. Removed. Last edited by rpspringuel; 05-13-2014 at 01:32 PM. Reason: New book upload. |
05-08-2014, 08:30 PM | #15 |
Enthusiast
Posts: 40
Karma: 10
Join Date: Feb 2014
Device: Kindle 4
|
So, it seems my understanding of copyright law was flawed and my test book was still in copyright. Sorry about that. I'll go looking for something that's out of copyright and create a new test book for those interested in testing.
|
Thread Tools | Search this Thread |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Kindle (AZW3/MOBI) ebooks with "real page numbers" to PDF with same page numbers? | abvgd | Conversion | 2 | 05-24-2013 01:24 PM |
How to add real page numbers for Kindle ebooks | sinan | Workshop | 2 | 08-17-2011 02:37 AM |
Do Sony Readers display real page numbers? | varlokkur | Sony Reader | 26 | 03-10-2011 04:10 AM |
Real Page Numbers | MarcusStringer | ePub | 12 | 02-10-2011 04:10 PM |
Page numbers in iphone vs Real Kindle | palex481 | Amazon Kindle | 26 | 03-16-2009 05:28 PM |