MobileRead Forums - View Single Post - KindleUnpack (MobiUnpack): Extracts text, images and metadata from Kindle/Mobi files

KevinH · 05-19-2015, 10:24 AM

Hi,

Thanks for the detailed bug report.

Quote:

Originally Posted by elmimmo

[LIST][*]XHTML documents in my source have the size of their canvas declared in their head block, as required by the EPUB 3 spec, with:

Code:

<meta name="viewport" content="width=794, height=1122"/>

with same values as those written by KindleUnpack in the output OPF's metadata:

Code:

<meta name="original-resolution" content="794x1122"/>

The latter is what Kindle requires but does nothing in EPUB 3 ereaders. The former is EPUB 3's way but does nothing in Kindle. Both are needed for the EPUB 3 to render properly in EPUB 3 ereaders and be safely convertible (back) to mobi.

I assume that the internal XHTML of most mobi files will lack EPUB 3's viewport's size declaration since Kindle makes no use of it (so no reason for the original author to have added it).

Understood. This was your bug report, correct? I wonder why kindlegen removes the meta values in the xhtml page head tag if they do no damage. If kindlegen does not remove them, then how we handle things is correct. Please understand, there is no guarantee that if an invalid epub3 is input into kindlegen, that you will unpack to a valid one. In fact in most cases, you will unpack to an invalid epub that will need to be fixed. If those meta viewport tags are actually removed by kindlegen, I will figure out a way to add them back, but if kindlegen leaves them untouched, the code is correct as it stands.

Quote:

KindleUnpack should add it if missing when outputting EPUB 3 FXL so that ereaders display them properly.

Please understand KindleUnpack just unpacks what is present and is recapturable from the AZW3, it does not guarantee the output is valid if the input is not valid. It is not going to try and fix things that were errors in the input. It is not a conversion program in and of itself.

Quote:

[*]KindleUnpack is being unnecessarily redundant by adding to EPUB 3 FXL's metadata both <meta name="fixed-layout" content="true" /> and <meta property="rendition:layout">pre-paginated</meta>. Both accomplish the same thing in Kindle, but only the latter does so too for EPUB 3 ereaders. KindleUnpack should therefore only add the latter.

The same thing goes for <meta name="orientation-lock" content="portrait" /> and <meta property="rendition

rientation">portrait</meta>, only the latter being EPUB 3 proper syntax.

Actually according to the epub3 spec, old style metadata is allowed and should be ignored by an epub3 device. So these will stay as they do not hurt things and help[ to document exactly what was present in the source.

Quote:

[*]My source OPF has its ISBN specified in EPUB 3 syntax:

Code:

<dc:identifier id="uid">urn:isbn:9781234567890</dc:identifier>
<meta refines="#uid" property="identifier-type" scheme="onix:codelist5">15</meta>

but KindleUnpack's output EPUB 3 has it like:

Code:

<dc:identifier opf:scheme="ISBN">9781234567890</dc:identifier>

I can't recapture the refines in all cases as they are stripped away in the conversion process, but I can try to correct it so that the opf: prefix is not used in an epub3 in dc tags.

Quote:

Code:

<dc:date opf:event="publication">2011</dc:date>

KindleUnpack should just drop that attribute when outputting EPUB 3. Note that adding it, is, however, valid and actually the right way to do it when outputting EPUB 2.

Will handle as above.

Quote:

[*]My source NCX (document which is not part of the EPUB 3 spec, but some consider good practice to add for backwards compatibility) is simpler than KindleUnpack's output. The extra stuff that KindleUnpack adds is cruft (irrespective of whether it is exporting to EPUB 2 or 3). Some parts of it:

Code:

<meta content="1" name="dtb:depth"/>
<meta content="mobiunpack.py" name="dtb:generator"/>
<meta content="0" name="dtb:totalPageCount"/>
<meta content="0" name="dtb:maxPageNumber"/>

are innocuous but still pointless since no EPUB ereader (neither Kindle) needs nor supports them in any way; some other parts, particularly, the DOCTYPE and each playOrder attribute, are, actually, harmful IMHO as they complicate post-editing by hand.

The DOCTYPE on the ncx is correct as stands and epubcheck 4 has fixed this bug in epub check 3. I will look at the DAISY spec to see about the extra meta data. If needed for the spec it stays, otherwaise I will remove it.

Quote:

Besides, if KindleUnpack is adding the NCX to EPUB 3 FXL for backwards compatibility purposes, then it should also generate the file com.apple.ibooks.display-options.xml that my source contains, as that file was Apple's method of tagging an EPUB 2 as a FXL book before EPUB officially embraced FXL in EPUB 3.0.1 (which uses another method to do so, but including that legacy file does not make the EPUB 3 file non-valid).

Sorry nothing ibooks specific will be dded/supported in any way. None of it is spec.

Quote:

Still, I, for one, do not see value in generating EPUB 2 FXL backwards compatibility cruft in EPUB 3 FXL(the NCX, the spine's toc attribute, the OPF's guide and the file com.apple.ibooks.display-options.xml). FXL was originally a non-standard Apple extension to EPUB 2 which Apple itself now considers obsolete in favor of proper EPUB 3 FXL, and all major platforms that once supported Apple's EPUB 2 FXL now support EPUB 3 FXL.

Again, the guide is allowed via the epub3 spec as it all of the old style metadata. And no ibooks anything will be supported as none of it is spec.

In the future, when I have more time, I may add an option for keeping or removing all of that but right now I am more worried about correctness, not what any one person considers "cruft". Sorry about that. The real purpose of the KindleUnpack tool was to help reverse engineer current and future mobi changes. That is its primary role. It is not a "converter" per se.

Quote:

[*]The language of the source ebook is Spanish, as specified in the OPF's metadata:

Code:

<dc:language>es</dc:language>

yet KindleUnpack's output's is:

Code:

<dc:language>en</dc:language>

That is definitely a bug. It should be exactly what the Mobi language code says as generated by kindlegen. That is what we (should be) outputing and only default to "en" if not language code is found. I will look into this.

Quote:

[*]In nav.xhtml, the namespace URI for the epub namespace is wrong. KindleUnpack's current output is:

Code:

<html xmlns="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2011/epub" …>

while it should be:

Code:

<html xmlns="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2007/ops" …>

I will verify it against the spec and change it if needed.

Quote:

[*]The source nav.xhtml has a meta tag declaring the text encoding:

Code:

<meta charset="UTF-8"/>

but KindleUnpack's output nav.xhtml does not. That might produce incorrect rendering of non-ASCII characters when accessing this book's TOC. The same goes for cover_page.xhtml. While the latter contains no text, it's still good practice to declare the encoding if anything to account for potential post-editing.

Again, we only are going to reproduce what we can from the actual metadata provided by the azw3. So if the nav had it correct on input to kindlegen, since kindlegen always specifies the charset in the header, this should be addable/fixable.

Quote:

[*]In spite of what the Kindle Publishing Guidelines (4.1, 5.1, 5.6) document claim, this does nothing whatsoever ever:

Code:

<meta name="RegionMagnification" content="true" />

AFAIK never has, and is just bloat. Kindlegen and Kindle Previewer will instead parse all HTML documents upon conversion in search for explicit region magnification markup, and will label the ebook accordingly irrespective of what that metadata value says.

Again sorry but, if it exists in the metadata in the azw3, it will be output. That is the whole point of KindleUnpack. You may consider it "cruft", I consider it documenting the metadata that is provided by the azw3 to the extent it can.

Please understand, KindleUnpack is not an epub2 or 3 converter. It is an unpacker that takes the compiled format of the azw3 and tries to create an epub-like structure to document what it finds and for people to later edit and fix.

Quote:

[*]I am not aware that the OPF metadata <meta name="output encoding" content="utf-8" /> in KindleUnpack's output accomplishes anything whatsoever, resulting in mere bloat, but it might be that I am not well informed.

It is documenting the charset provided in the azw3 header.

Quote:

[*]If the source has no spine item with the attributes properties="page-spread-right" or properties="page-spread-left", it might be considered appropriate that KindleUnpack adds <meta property="rendition:spread">none</meta> to the output OPF metadata of an EPUB 3 FXL, so that EPUB 3 ereaders like iBooks do not group pages in 2-page spreads just like Kindle doesn't.

None is the default is it not?

I'll pick back up commenting on the remainder when I get more time.

Take care,

KevinH