MobileRead Forums - View Single Post

rpspringuel · 02-08-2014, 04:21 PM

Hmm... Reverse engineering from a hex representation is beyond my immediate abilities and I don't have the time at the moment to learn.

However, looking at the information in the wiki (and pages it links to), I might have a close substitute (based, it seems, largely on how ePub does it).

Within the document, pages are marked as follows:
<span epub:type="pagebreak" id="page_ii" title="ii"/>
Problem: When an .azw3 file is saved and then reopened by calibre the ":" character in the first attribute is converted to "U0003A". Not sure what, if any, effect this will have. Fixing it probably involves modifying the .azw3 encoder/decoder to recognize the ":" character as valid within the span tag (at least within this context).

For compatibility with KindleGen, a page_map.xml file would have be added to the book (just like an ePub does). Given the above markers in the text, a script could easily be written that would generate this file automatically.
Problem: When an .azw3 file is saved and then reopened by calibre currently this file is lost. Again, fixing this probably involves modifying the .azw3 encoder/decoder to recognize this as a valid element to the file.

Also, an ePub would normally add a dc:source element to the document metadata to indicate the print source. Presumably KindleGen needs something similar, but I cannot find anything specifically about this. In any case, calibre currently will not retain such an element in the metadata.opf resource of an .azw3 file.

However, since I'm not looking to push my documents through amazon publishing (and thus not using KindleGen), I don't think I actually need to deal with these problems. All I needed is some way to mark the page breaks that apnx.py can be programed to recognize. If I use the ePub page break marker, I can write a script that looks for it in the document as it is actually remembered and use those locations to generate the .apnx file.

The ePub marker meets the criterion that I was looking for earlier. It's invisible, and thus doesn't affect rendering of the book; the presence of "pagebreak" in it makes it easily identifiable and otherwise unlikely to appear in a book; and the id (and the title) are human readable for editing purposes. Further, should someone else be motivated to actually address the problems I identified above (making the process compatible with KindleGen and thus enabling publishing with real page numbers through amazon) a simple find and replace can correct the tag, somewhat future proofing this modification.

Comments on this idea?

02-08-2014, 04:21 PM	#8
rpspringuel Enthusiast Posts: 40 Karma: 10 Join Date: Feb 2014 Device: Kindle 4	Hmm... Reverse engineering from a hex representation is beyond my immediate abilities and I don't have the time at the moment to learn. However, looking at the information in the wiki (and pages it links to), I might have a close substitute (based, it seems, largely on how ePub does it). Within the document, pages are marked as follows: <span epub:type="pagebreak" id="page_ii" title="ii"/> Problem: When an .azw3 file is saved and then reopened by calibre the ":" character in the first attribute is converted to "U0003A". Not sure what, if any, effect this will have. Fixing it probably involves modifying the .azw3 encoder/decoder to recognize the ":" character as valid within the span tag (at least within this context). For compatibility with KindleGen, a page_map.xml file would have be added to the book (just like an ePub does). Given the above markers in the text, a script could easily be written that would generate this file automatically. Problem: When an .azw3 file is saved and then reopened by calibre currently this file is lost. Again, fixing this probably involves modifying the .azw3 encoder/decoder to recognize this as a valid element to the file. Also, an ePub would normally add a dc:source element to the document metadata to indicate the print source. Presumably KindleGen needs something similar, but I cannot find anything specifically about this. In any case, calibre currently will not retain such an element in the metadata.opf resource of an .azw3 file. However, since I'm not looking to push my documents through amazon publishing (and thus not using KindleGen), I don't think I actually need to deal with these problems. All I needed is some way to mark the page breaks that apnx.py can be programed to recognize. If I use the ePub page break marker, I can write a script that looks for it in the document as it is actually remembered and use those locations to generate the .apnx file. The ePub marker meets the criterion that I was looking for earlier. It's invisible, and thus doesn't affect rendering of the book; the presence of "pagebreak" in it makes it easily identifiable and otherwise unlikely to appear in a book; and the id (and the title) are human readable for editing purposes. Further, should someone else be motivated to actually address the problems I identified above (making the process compatible with KindleGen and thus enabling publishing with real page numbers through amazon) a simple find and replace can correct the tag, somewhat future proofing this modification. Comments on this idea?