Real Page Numbers

rpspringuel · 02-02-2014, 04:06 PM

I've been exploring the apnx generator and really like how I can now get page numbers in my Kindle instead of just location numbers.

However, while the estimated page numbers are fine most of the time, as an academic its sometimes important that I know exactly which page I'm on when constructing a citation reference. Obviously, to do this would require some manual editing of the ebook to mark where pages start. That's obviously a lot of work, but I only need to do it for a limited number of books so I consider it a reasonable trade off in some circumstances.

To that end, I'd like some feedback on how to make this work. My thoughts are thus:

Use a tag to mark pages. This tag should be unique and unlikely to appear within a book normally. It also should not print anything to the screen of the reader so as to not interfere with the reading of the book. Normally this would lead me to use a special comment like "", but it appears that comments are are not retained in an edited azw3 book. Is there anyone familiar enough with the azw3 format to know what sort of tag could be used to fulfill this requirement?
In apnx.py define a new function, get_pages_real, which scans the text like get_pages_accurate does, except instead of trying to count lines and marking a page every 30 lines, it simply marks a page when it encounters the above mentioned tag.
Modify write_apnx so that the parameter "accurate" isn't boolean, but rather accepts three options: real, accurate, fast. If real is called for and fails due to there being no page markers in the text, the algorithm should spit out a warning and then try accurate. If it fails due to DRM, it should spit out a warning and then try fast (as the algorithm currently does for accurate).

On a related note, does anyone know how the apnx files handle pages in the front matter which are numbered with roman numerals and then the page count resetting when the main matter of the book starts?

I should note that I can program in python, and thus could make the necessary code modifications myself to apnx.py. However, I don't know how to integrate those changes into the user interface of calibre. My coding work has all been for people who can read and manipulate source code. I've never worried about a user interface before (beyond simple raw_input/input prompts). Thus while I'm perfectly willing to do the under the hood work I'll need some help getting it integrated.

jackie_w · 02-02-2014, 04:29 PM

I know nothing about Kindle apnx files, but I often see html markup something like
<a id="p100" /> or <a id="pviii"></a> in retail epubs. The markup is not visible when reading.

There is also occasionally an xml page-map file in epubs. I know very little about them other than I had to remove them when I was reading on an old Sony device because they caused me problems.

Hopefully someone with better knowledge can add more.

Edit: An epub page-map looks something like this

Code:

<?xml version='1.0' encoding='utf-8'?>
<page-map xmlns="http://www.idpf.org/2007/opf">
  <page href="OEBPS/copyright.html#piii" name="piii"/>
  <page href="OEBPS/copyright.html#piv" name="piv"/>
  <page href="OEBPS/preface001.html#pi" name="pi"/>
  <page href="OEBPS/ad-card.html#pii" name="pii"/>
  <page href="OEBPS/dedication.html#pv" name="pv"/>
  <page href="OEBPS/acknowledgements.html#pvi" name="pvi"/>
  <page href="OEBPS/acknowledgements.html#pvii" name="pvii"/>
  <page href="OEBPS/part001.html#pviii" name="pviii"/>
  <page href="OEBPS/part001.html#p1" name="p1"/>
  <page href="OEBPS/chapter003.html#p2" name="p2"/>
  <page href="OEBPS/chapter003.html#p3" name="p3"/>
  <page href="OEBPS/chapter003.html#p4" name="p4"/>
etc ... etc ...
</page-map>

and is referenced if the <spine> tag of the opf file, e.g. <spine toc="ncx" page-map="page-map">

kovidgoyal · 02-02-2014, 09:18 PM

You dont need to use tags for this, you can use data- prefixed attributes. They are ignored by renderers.

The proper solution is of course to reverse engineer whatever facility amazon uses for real page numbers in apnx by buying a few azw3 books with real page numbers, but that will likely be a lot of effort. Once the reverse enginnering is done you can use the pagelist technology from epub and map it into the equivalent structure in azw3. This is asssuming the azw3 version is in file and not in a sidecar file.

rpspringuel · 02-07-2014, 10:06 AM

Correct me if I'm wrong, but isn't an attribute a property of a tag? I.e. I can't just put 'data-page="1"' in the text of the file (it would be treated as text if I did) but must put something like '<wbr data-page="1">'. Now, I'll grant you that if every page break occurred at the start of a new element (heading, paragraph, etc.), one could simply add that attribute to the appropriate opening element tag, but page breaks often occur in the middle of a paragraph element where there is no existing tag to attach the element to. I would thus need to introduce a tag in those locations. Further, I would argue that for consistency sake it would be better if all page break locations, not just those in the middle of a paragraph were marked by the same element. This makes them easier to find in a human-readable fashion.

In researching the data- attribute (which I hadn't heard of before) I discovered the wordbreak (wbr) tag, which I think is a good candidate for marking page locations (hence my use of it above). It's a void element, and thus doesn't require a companion closing tag (unlike an anchor (a) tag). It is a new tag to HTML5 and is intended for marking line break opportunities in really long words. For both reasons, it should be unlikely to appear in most books. My quick testing shows that it is a tag which is preserved in azw3 and it doesn't affect the viewing of the document.

Of course, that's if the reverse engineering process doesn't pan out. A quick search on amazon found that they do have at least some books for free with real page numbers. Not anything I would normally want to read, but then that isn't the purpose here. I haven't had the chance to "buy" them yet to discover what their file format is (amazon doesn't list the file format in the item description), but hopefully there's enough to find some in azw3 format. I'll start looking for that this weekend, hopefully.

rpspringuel · 02-07-2014, 05:25 PM

So, I got around to checking on those files a bit sooner than I thought I would and found that like the .mobi format (which isn't editable) the .azw3 format (which is) uses the side-along .apnx file to mark page numbers. Further, if you open an .azw3 file to edit it, there is nothing in the file that marks the pagination. Amazon must have some other way of producing the .apnx file. Obviously at some point someone has to match places in the text with the beginnings of pages to produce the .apnx file, but that work is not done in the .azw3 file (or if it is, it's stripped out by amazon before the book ships or by calibre when it opens the book for editing).

So, I think I'm back to my original plan (manually mark the page breaks in the text and then use a modified apnx.py to create the side-along file).

On a related note, I've noticed that apnx.py only works on .mobi formats, not .azw3. So making this work will also involve modifying it to accept a new format.

eschwartz · 02-07-2014, 05:48 PM

According to APNX#Kindle_publishing:

Quote:

KindleGen version 1.2 does not generate an APNX file directly; it creates a PAGE section in the MobiPocket file which is then stripped and converted to an APNX file by Amazon's publishing service. The KindleGen input can use either a NCX pageList or page-map xml.

kovidgoyal · 02-07-2014, 10:12 PM

You cannot use calibre to check if an azw3 file has page information, since calibre knows nothing about page information in azw3, it will just discard any such data present in the azw3. You would need to dump all records in the azw3 using

calibre-debug file.azw3

or the kindleunpack program then use a hex editor to examine any records that look like they contain page information and reverse engineer them. I'd start with the PAGE record.

However, since amazon appears to strip the PAGE record from books it delivers to devices, it seems likely that the actual Kindles wont use them. SO even if you figure out how to create them, you would then need to modify the apnx generator code in calibre to strip tham and convert them to apnx when sending the azw3 files to the kindle.

rpspringuel · 02-08-2014, 03:21 PM

Hmm... Reverse engineering from a hex representation is beyond my immediate abilities and I don't have the time at the moment to learn.

However, looking at the information in the wiki (and pages it links to), I might have a close substitute (based, it seems, largely on how ePub does it).

Within the document, pages are marked as follows:
<span epub:type="pagebreak" id="page_ii" title="ii"/>
Problem: When an .azw3 file is saved and then reopened by calibre the ":" character in the first attribute is converted to "U0003A". Not sure what, if any, effect this will have. Fixing it probably involves modifying the .azw3 encoder/decoder to recognize the ":" character as valid within the span tag (at least within this context).

For compatibility with KindleGen, a page_map.xml file would have be added to the book (just like an ePub does). Given the above markers in the text, a script could easily be written that would generate this file automatically.
Problem: When an .azw3 file is saved and then reopened by calibre currently this file is lost. Again, fixing this probably involves modifying the .azw3 encoder/decoder to recognize this as a valid element to the file.

Also, an ePub would normally add a dc:source element to the document metadata to indicate the print source. Presumably KindleGen needs something similar, but I cannot find anything specifically about this. In any case, calibre currently will not retain such an element in the metadata.opf resource of an .azw3 file.

However, since I'm not looking to push my documents through amazon publishing (and thus not using KindleGen), I don't think I actually need to deal with these problems. All I needed is some way to mark the page breaks that apnx.py can be programed to recognize. If I use the ePub page break marker, I can write a script that looks for it in the document as it is actually remembered and use those locations to generate the .apnx file.

The ePub marker meets the criterion that I was looking for earlier. It's invisible, and thus doesn't affect rendering of the book; the presence of "pagebreak" in it makes it easily identifiable and otherwise unlikely to appear in a book; and the id (and the title) are human readable for editing purposes. Further, should someone else be motivated to actually address the problems I identified above (making the process compatible with KindleGen and thus enabling publishing with real page numbers through amazon) a simple find and replace can correct the tag, somewhat future proofing this modification.

Comments on this idea?

kovidgoyal · 02-08-2014, 10:17 PM

IIRC epub:type="pagebreak" is an epub3 specific extension. Currently, almost nothing supports it.

The colon refers to a XML namespace. If you want to use it, you have to declare the epub namespace and make sure the document you are modifying is valid XML. The IDPF just likes to make everyone's life harder by using XHTML instead of plain HTML 5.

Other than that, it's fine, although note that inserting an empty span tag into a document can have side effects, since the document can use CSS selectors based on tag counts.

As I said before, the only sure way of modifying the document with no side effects is to use data- attributes. But that hhas the limitation of restricting page markers to existing tag locations.

DoctorOhh · 02-09-2014, 08:56 PM

Moderator Notice
I failed to comprehend what I read and previously moved this thread out of the development forum. Upon review I was wrong. The thread has been restored to its original location.

rpspringuel · 02-14-2014, 12:01 PM

Quote:

Originally Posted by kovidgoyal

IIRC epub:type="pagebreak" is an epub3 specific extension. Currently, almost nothing supports it.

Well, my reason for using it isn't because it's currently supported but rather to try and future proof my work and make it easier on the next guy who wants to expand on this. Since I'm making an admitted hack on how azw3 works and thus could do any number of things, I might as well use something that will be nicer on the next guy.

Quote:

Originally Posted by kovidgoyal

The colon refers to a XML namespace. If you want to use it, you have to declare the epub namespace and make sure the document you are modifying is valid XML. The IDPF just likes to make everyone's life harder by using XHTML instead of plain HTML 5.

Declaring the namespace isn't that hard. I just need to add xmlns:epub="http://www.idpf.org/2007/ops" in the right place.

Unfortunately the editor currently doesn't know what to do with this declaration though. If added to the metadata tag in the metadata file (where several other namespaces are declared) then the declaration is lost in a save/close/reopen cycle. Uses of the namespace in the text files are unaffected (: gets converted to u0003a). If I try to use one of the other namespaces that are declared in the same place (dc, opf, calibre) the character swap still happens. If I declare the epub namespace within the html tag of a text document, then the declaration is removed and the name space is stripped from the tags where it is used (i.e. epub:type="pagebreak" becomes type="pagebreak"). This behavior is all specific to editing azw3 files, editing ePub's exhibit none of these behaviors (ePub's even retain the : when the namespace hasn't been declared).

As for the file being valid XML, isn't that a given? I understood azw3 to be an amazon specific compilation of ePub. Since ePub files have to be valid XML (or more specifically XHTML) shouldn't an azw3 file be valid XML? Am I missing something?

Quote:

Originally Posted by kovidgoyal

Other than that, it's fine, although note that inserting an empty span tag into a document can have side effects, since the document can use CSS selectors based on tag counts.

As I said before, the only sure way of modifying the document with no side effects is to use data- attributes. But that hhas the limitation of restricting page markers to existing tag locations.

Unfortunately I think this is a chance that I'll have to live with. If I force the pagebreak markers to use existing tags, I'll have to move them from where they actually occur, which kind of defeats the purpose of what I want to do in the first place. Since this is based on the ePub standard (which also uses span tags) that would imply that using CSS selectors based on tag counts would not be recommended in this instance anyway.

kovidgoyal · 02-14-2014, 11:03 PM

Quote:

Originally Posted by rpspringuel

Declaring the namespace isn't that hard. I just need to add xmlns:epub="http://www.idpf.org/2007/ops" in the right place.

Thats not the hard part, see below.

Quote:

Unfortunately the editor currently doesn't know what to do with this declaration though. If added to the metadata tag in the metadata file (where several other namespaces are declared) then the declaration is lost in a save/close/reopen cycle. Uses of the namespace in the text files are unaffected (: gets converted to u0003a). If I try to use one of the other namespaces that are declared in the same place (dc, opf, calibre) the character swap still happens. If I declare the epub namespace within the html tag of a text document, then the declaration is removed and the name space is stripped from the tags where it is used (i.e. epub:type="pagebreak" becomes type="pagebreak"). This behavior is all specific to editing azw3 files, editing ePub's exhibit none of these behaviors (ePub's even retain the : when the namespace hasn't been declared).

Since azw3 does not support the epub namespace or indeed the epub spec in any form, that's hardly surprising.

Quote:

As for the file being valid XML, isn't that a given? I understood azw3 to be an amazon specific compilation of ePub. Since ePub files have to be valid XML (or more specifically XHTML) shouldn't an azw3 file be valid XML? Am I missing something?

Ah the innocence of youth. The chances of encountering a valid XML file, let alone a valid XHTML file in the wild are about as high as the chances of encountering valid XHTML on the web. So if you want to use epub:type the onus is on you to make sure that the file you are outputting somehow magically becomes valid XML from tag soup. calibre has nearly 10,000 lines of code dedicated to the task of taking tag soup and outputting valid XML. SO unless your book has been previously processed by calibre and then not further processed by anything else, you are likely to be out of luck.

Quote:

Since this is based on the ePub standard (which also uses span tags) that would imply that using CSS selectors based on tag counts would not be recommended in this instance anyway.

The epub standard does not use span tags. The standard simply declares an attribute that can be placed on (almost) any tag. And the selectors I am referring to are selectors like first-child, last-child, nth-child which will break if you change the number of children by inserting a new tag.

rpspringuel · 02-21-2014, 04:12 PM

Quote:

Originally Posted by kovidgoyal

Since azw3 does not support the epub namespace or indeed the epub spec in any form, that's hardly surprising.

Didn't mean to imply it was surprising. I was more just creating a record of the information I'd found. I'm not very good at remembering all these sort of details otherwise.

Quote:

Originally Posted by kovidgoyal

Ah the innocence of youth. The chances of encountering a valid XML file, let alone a valid XHTML file in the wild are about as high as the chances of encountering valid XHTML on the web. So if you want to use epub:type the onus is on you to make sure that the file you are outputting somehow magically becomes valid XML from tag soup. calibre has nearly 10,000 lines of code dedicated to the task of taking tag soup and outputting valid XML. SO unless your book has been previously processed by calibre and then not further processed by anything else, you are likely to be out of luck.

Well, this is for books where I've manually edited in the page breaks so I can actually expect that calibre has been the processor. Indeed, since amazon uses a side-along file I don't expect them to be there otherwise.

Quote:

Originally Posted by kovidgoyal

The epub standard does not use span tags. The standard simply declares an attribute that can be placed on (almost) any tag. And the selectors I am referring to are selectors like first-child, last-child, nth-child which will break if you change the number of children by inserting a new tag.

I'm referring to this part of the standard which do specify that page breaks should use span or div tags (I missed that div tags could also be used).

Anyway, it looks like I've got enough information to get to work now.

rpspringuel · 05-08-2014, 05:50 PM

Okay, so I finally got around to doing something here and have come up with something that appears to work for me.

What I ended up doing was as follows:

Instead of writing a new function (and thus having to figure out how to get calibre to call it) I hijacked the get_pages_exact function that already existed in apnx.py. I did not eliminate the code that was already there, but rather modified it so that if the incoming page_count is negative, my new code is run. If the page_count was positive, the original code is run. I'm open to changing this, but would need help figuring out how to tell calibre what the new function is and under what circumstances to use it.
My code scan the file looking for tags which contain "pagebreak". I chose to only look for that string because it was common to both the ePub standard I was thinking of using above and with the mobi tag "<mbp:pagebreak/>"* which is used to force manual page breaks in mobi files. However, since I look at every tag and only look for "pagebreak" and not anything else, it's also possible to use things like 'data-pagebreak="2"' in an existing tag to mark a page break (indeed, that's what I ended up doing for my test book).
While I was originally going to try to differentiate between page numbering sequences (like i, ii, iii, etc. for front matter and 1, 2, 3, etc. for main matter) I determined that doing so would require a more significant rewrite of John's work and thus decided to forgo that. As a result, the apnx files produced when my code is run start at 1 and run sequentially just like John's do.

Attached are my version of apnx.py (zipped up) and a book in azw3 format with the pages already marked.

Edit: I've now attached a new book which I believe to be out of copyright. Published in 1926, the original author died in 1868 and the translator died in 1902. It has 93 pages of about 27 lines of ~47 characters. The book has also has 14 pages of front matter which are not included in the page count. I've only marked the pages in the main body, so you should get 93 pages using my code with page 1 occurring after the table of contents.

I'd appreciate it if others could test it out and provide advice on the implementation.

*I should note that the "<mbp:pagebreak/>" tag suffers from the same problem with the colon being replaced by u0003a that I described earlier.

Copyrighted material may not be posted on Mobileread. Removed.

rpspringuel · 05-08-2014, 08:30 PM

So, it seems my understanding of copyright law was flawed and my test book was still in copyright. Sorry about that.

I'll go looking for something that's out of copyright and create a new test book for those interested in testing.

02-02-2014, 04:06 PM	#1
rpspringuel Enthusiast Posts: 40 Karma: 10 Join Date: Feb 2014 Device: Kindle 4	Real Page Numbers I've been exploring the apnx generator and really like how I can now get page numbers in my Kindle instead of just location numbers. However, while the estimated page numbers are fine most of the time, as an academic its sometimes important that I know exactly which page I'm on when constructing a citation reference. Obviously, to do this would require some manual editing of the ebook to mark where pages start. That's obviously a lot of work, but I only need to do it for a limited number of books so I consider it a reasonable trade off in some circumstances. To that end, I'd like some feedback on how to make this work. My thoughts are thus: Use a tag to mark pages. This tag should be unique and unlikely to appear within a book normally. It also should not print anything to the screen of the reader so as to not interfere with the reading of the book. Normally this would lead me to use a special comment like "<!--page-->", but it appears that comments are are not retained in an edited azw3 book. Is there anyone familiar enough with the azw3 format to know what sort of tag could be used to fulfill this requirement? In apnx.py define a new function, get_pages_real, which scans the text like get_pages_accurate does, except instead of trying to count lines and marking a page every 30 lines, it simply marks a page when it encounters the above mentioned tag. Modify write_apnx so that the parameter "accurate" isn't boolean, but rather accepts three options: real, accurate, fast. If real is called for and fails due to there being no page markers in the text, the algorithm should spit out a warning and then try accurate. If it fails due to DRM, it should spit out a warning and then try fast (as the algorithm currently does for accurate). On a related note, does anyone know how the apnx files handle pages in the front matter which are numbered with roman numerals and then the page count resetting when the main matter of the book starts? I should note that I can program in python, and thus could make the necessary code modifications myself to apnx.py. However, I don't know how to integrate those changes into the user interface of calibre. My coding work has all been for people who can read and manipulate source code. I've never worried about a user interface before (beyond simple raw_input/input prompts). Thus while I'm perfectly willing to do the under the hood work I'll need some help getting it integrated.

02-02-2014, 04:29 PM	#2
jackie_w Grand Sorcerer Posts: 6,212 Karma: 16534894 Join Date: Sep 2009 Location: UK Device: Kobo: KA1, ClaraHD, Forma, Libra2, Clara2E. PocketBook: TouchHD3	I know nothing about Kindle apnx files, but I often see html markup something like <a id="p100" /> or <a id="pviii"></a> in retail epubs. The markup is not visible when reading. There is also occasionally an xml page-map file in epubs. I know very little about them other than I had to remove them when I was reading on an old Sony device because they caused me problems. Hopefully someone with better knowledge can add more. Edit: An epub page-map looks something like this Code: <?xml version='1.0' encoding='utf-8'?> <page-map xmlns="http://www.idpf.org/2007/opf"> <page href="OEBPS/copyright.html#piii" name="piii"/> <page href="OEBPS/copyright.html#piv" name="piv"/> <page href="OEBPS/preface001.html#pi" name="pi"/> <page href="OEBPS/ad-card.html#pii" name="pii"/> <page href="OEBPS/dedication.html#pv" name="pv"/> <page href="OEBPS/acknowledgements.html#pvi" name="pvi"/> <page href="OEBPS/acknowledgements.html#pvii" name="pvii"/> <page href="OEBPS/part001.html#pviii" name="pviii"/> <page href="OEBPS/part001.html#p1" name="p1"/> <page href="OEBPS/chapter003.html#p2" name="p2"/> <page href="OEBPS/chapter003.html#p3" name="p3"/> <page href="OEBPS/chapter003.html#p4" name="p4"/> etc ... etc ... </page-map> and is referenced if the <spine> tag of the opf file, e.g. <spine toc="ncx" page-map="page-map"> Last edited by jackie_w; 02-02-2014 at 04:33 PM. Reason: more info

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Kindle (AZW3/MOBI) ebooks with "real page numbers" to PDF with same page numbers?	abvgd	Conversion	2	05-24-2013 01:24 PM
How to add real page numbers for Kindle ebooks	sinan	Workshop	2	08-17-2011 02:37 AM
Do Sony Readers display real page numbers?	varlokkur	Sony Reader	26	03-10-2011 04:10 AM
Real Page Numbers	MarcusStringer	ePub	12	02-10-2011 04:10 PM
Page numbers in iphone vs Real Kindle	palex481	Amazon Kindle	26	03-16-2009 05:28 PM

02-02-2014, 09:18 PM	#3
kovidgoyal creator of calibre Posts: 43,860 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	You dont need to use tags for this, you can use data- prefixed attributes. They are ignored by renderers. The proper solution is of course to reverse engineer whatever facility amazon uses for real page numbers in apnx by buying a few azw3 books with real page numbers, but that will likely be a lot of effort. Once the reverse enginnering is done you can use the pagelist technology from epub and map it into the equivalent structure in azw3. This is asssuming the azw3 version is in file and not in a sidecar file.

02-07-2014, 10:06 AM	#4
rpspringuel Enthusiast Posts: 40 Karma: 10 Join Date: Feb 2014 Device: Kindle 4	Correct me if I'm wrong, but isn't an attribute a property of a tag? I.e. I can't just put 'data-page="1"' in the text of the file (it would be treated as text if I did) but must put something like '<wbr data-page="1">'. Now, I'll grant you that if every page break occurred at the start of a new element (heading, paragraph, etc.), one could simply add that attribute to the appropriate opening element tag, but page breaks often occur in the middle of a paragraph element where there is no existing tag to attach the element to. I would thus need to introduce a tag in those locations. Further, I would argue that for consistency sake it would be better if all page break locations, not just those in the middle of a paragraph were marked by the same element. This makes them easier to find in a human-readable fashion. In researching the data- attribute (which I hadn't heard of before) I discovered the wordbreak (wbr) tag, which I think is a good candidate for marking page locations (hence my use of it above). It's a void element, and thus doesn't require a companion closing tag (unlike an anchor (a) tag). It is a new tag to HTML5 and is intended for marking line break opportunities in really long words. For both reasons, it should be unlikely to appear in most books. My quick testing shows that it is a tag which is preserved in azw3 and it doesn't affect the viewing of the document. Of course, that's if the reverse engineering process doesn't pan out. A quick search on amazon found that they do have at least some books for free with real page numbers. Not anything I would normally want to read, but then that isn't the purpose here. I haven't had the chance to "buy" them yet to discover what their file format is (amazon doesn't list the file format in the item description), but hopefully there's enough to find some in azw3 format. I'll start looking for that this weekend, hopefully.

02-07-2014, 05:25 PM	#5
rpspringuel Enthusiast Posts: 40 Karma: 10 Join Date: Feb 2014 Device: Kindle 4	So, I got around to checking on those files a bit sooner than I thought I would and found that like the .mobi format (which isn't editable) the .azw3 format (which is) uses the side-along .apnx file to mark page numbers. Further, if you open an .azw3 file to edit it, there is nothing in the file that marks the pagination. Amazon must have some other way of producing the .apnx file. Obviously at some point someone has to match places in the text with the beginnings of pages to produce the .apnx file, but that work is not done in the .azw3 file (or if it is, it's stripped out by amazon before the book ships or by calibre when it opens the book for editing). So, I think I'm back to my original plan (manually mark the page breaks in the text and then use a modified apnx.py to create the side-along file). On a related note, I've noticed that apnx.py only works on .mobi formats, not .azw3. So making this work will also involve modifying it to accept a new format.

02-07-2014, 10:12 PM	#7
kovidgoyal creator of calibre Posts: 43,860 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	You cannot use calibre to check if an azw3 file has page information, since calibre knows nothing about page information in azw3, it will just discard any such data present in the azw3. You would need to dump all records in the azw3 using calibre-debug file.azw3 or the kindleunpack program then use a hex editor to examine any records that look like they contain page information and reverse engineer them. I'd start with the PAGE record. However, since amazon appears to strip the PAGE record from books it delivers to devices, it seems likely that the actual Kindles wont use them. SO even if you figure out how to create them, you would then need to modify the apnx generator code in calibre to strip tham and convert them to apnx when sending the azw3 files to the kindle.

02-08-2014, 03:21 PM	#8
rpspringuel Enthusiast Posts: 40 Karma: 10 Join Date: Feb 2014 Device: Kindle 4	Hmm... Reverse engineering from a hex representation is beyond my immediate abilities and I don't have the time at the moment to learn. However, looking at the information in the wiki (and pages it links to), I might have a close substitute (based, it seems, largely on how ePub does it). Within the document, pages are marked as follows: <span epub:type="pagebreak" id="page_ii" title="ii"/> Problem: When an .azw3 file is saved and then reopened by calibre the ":" character in the first attribute is converted to "U0003A". Not sure what, if any, effect this will have. Fixing it probably involves modifying the .azw3 encoder/decoder to recognize the ":" character as valid within the span tag (at least within this context). For compatibility with KindleGen, a page_map.xml file would have be added to the book (just like an ePub does). Given the above markers in the text, a script could easily be written that would generate this file automatically. Problem: When an .azw3 file is saved and then reopened by calibre currently this file is lost. Again, fixing this probably involves modifying the .azw3 encoder/decoder to recognize this as a valid element to the file. Also, an ePub would normally add a dc:source element to the document metadata to indicate the print source. Presumably KindleGen needs something similar, but I cannot find anything specifically about this. In any case, calibre currently will not retain such an element in the metadata.opf resource of an .azw3 file. However, since I'm not looking to push my documents through amazon publishing (and thus not using KindleGen), I don't think I actually need to deal with these problems. All I needed is some way to mark the page breaks that apnx.py can be programed to recognize. If I use the ePub page break marker, I can write a script that looks for it in the document as it is actually remembered and use those locations to generate the .apnx file. The ePub marker meets the criterion that I was looking for earlier. It's invisible, and thus doesn't affect rendering of the book; the presence of "pagebreak" in it makes it easily identifiable and otherwise unlikely to appear in a book; and the id (and the title) are human readable for editing purposes. Further, should someone else be motivated to actually address the problems I identified above (making the process compatible with KindleGen and thus enabling publishing with real page numbers through amazon) a simple find and replace can correct the tag, somewhat future proofing this modification. Comments on this idea?

02-08-2014, 10:17 PM	#9
kovidgoyal creator of calibre Posts: 43,860 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	IIRC epub:type="pagebreak" is an epub3 specific extension. Currently, almost nothing supports it. The colon refers to a XML namespace. If you want to use it, you have to declare the epub namespace and make sure the document you are modifying is valid XML. The IDPF just likes to make everyone's life harder by using XHTML instead of plain HTML 5. Other than that, it's fine, although note that inserting an empty span tag into a document can have side effects, since the document can use CSS selectors based on tag counts. As I said before, the only sure way of modifying the document with no side effects is to use data- attributes. But that hhas the limitation of restricting page markers to existing tag locations.

02-09-2014, 08:56 PM	#10
DoctorOhh US Navy, Retired Posts: 9,864 Karma: 13806776 Join Date: Feb 2009 Location: North Carolina Device: Icarus Illumina XL HD, Nexus 7	Moderator Notice I failed to comprehend what I read and previously moved this thread out of the development forum. Upon review I was wrong. The thread has been restored to its original location.

05-08-2014, 08:30 PM	#15
rpspringuel Enthusiast Posts: 40 Karma: 10 Join Date: Feb 2014 Device: Kindle 4	So, it seems my understanding of copyright law was flawed and my test book was still in copyright. Sorry about that. I'll go looking for something that's out of copyright and create a new test book for those interested in testing.

Advert

Advert