KindleUnpack (MobiUnpack): Extracts text, images and metadata from Kindle/Mobi files - Page 61

tkeo · 07-12-2014, 10:01 AM

Hi,

Quote:

Originally Posted by AcidWeb

orientation-lock have three valid values: portrait, landscape and none

I think you are right, but kindlegen.exe v2.9 does not convert EXTH to orientation-lock none from

Code:

<meta "property"="rendition:orientation">auto</meta>

if rendition: orientation is auto, no EXTH orientation-lock is added.

Thanks,

KevinH · 07-12-2014, 10:03 AM

Hi Paul and DiapDealer ( and anyone else with Kindle ebook collections ),

Do you own an Amazon ebook that uses EXTH 508, 517 or 522? Do you know who reverse engineered those EXTH values?

One quick thing to try is to cd to your My Kindle Content directory and run DumpMobiHeader_v016 (or later) on the *.azw ebook files. It will dump the EXTH even if the ebook is DRM'd since the headers themselves are not encrypted.

I then redirect all the output to a big text file and then use grep to find those tag values.

I would love to know if any of those supposed file-as EXTH values are ever set. If do, I will try to grab a sample of that book to see if I can figure out how they were set and why?

Thanks,

KevinH

tkeo · 07-12-2014, 10:11 AM

Quote:

Originally Posted by KevinH

I don't know where or who reversed those tags meanings. I have searched and none of my Amazon ebooks have any of those EXTH tags set.

I have books which have those EXTH tags. They are corresponding to yomigana(pronunciations) of kanji characters in Japanese.

tkeo · 07-12-2014, 10:48 AM

Hi,

The sample of the following book has Creator file-as and Title file-as EXTH. The sample is no-DRM.

Zenyaku Genji-Monogatari (Japanese Edition) ASIN: B00BHHKABO

http://www.amazon.co.jp/%E5%85%A8%E8...89%A9%E8%AA%9E
http://www.amazon.com/Zenyaku-Genji-...s=genji+kindle

Thanks,

tkeo · 07-12-2014, 11:17 AM

Hi,

I have modified KindleUnpack v0.72z to fix bugs and to simplify the code.

Except for 'refines' tags and excluding epub3 tags in epub2, I think I have done in my mind.

Thanks,
tkeo

tkeo · 07-12-2014, 08:45 PM

Hi Kevin,

Quote:

Originally Posted by KevinH

Do you own an Amazon ebook that uses EXTH 508, 517 or 522? Do you know who reverse engineered those EXTH values?

I added those EXTH at KindleUnpack v0.63.
I had seem somewhere on the internet that file-as meta were used as yomigana (or hurigana) to build an epub. So, I thought EXTH 508, 517 and 522 were corresponded to meta file-as of epub3.

Now I have a guess that EXTH 508, 517 and 522 are converted from
<meta name="???kana" content="XXXX"/> or <meta name="???gana" content="XXXX"/>.

Thanks,

tkeo · 07-12-2014, 11:53 PM

Hi Kevin,

Here is another patch for KindleUnpack v0.72z. It includes the patch I posted before. In addition, It has

More simplification of mobi_opf.py
Addition of print message before makeEPUB() in kindleunpack.py

Take care,
tkeo

tkeo · 07-13-2014, 09:33 AM

Hi,

This is the faster version of mobi_split.py.
I have removed the code for debug from which posted before.

The comparison of processing performances is as follows,

tested mobi file: 26MB 164 images

original: 6.3s
modified: 0.5s

To Kevin,
Please include this in the next official release if possible.

Take care,
tkeo

KevinH · 07-13-2014, 08:28 PM

Hi tkeo,
Have you tested the new mobi_split code to make sure that it is still building mobi7 and azw3 pieces completely correctly? I have not had time to look it over yet, but if you are sure, I will include it.

I also have found a few more bugs in KindleUnpack that I will post a patch for either later tonight my time or tomorrow. I have changes for fixing the <image> tag in the svg mobi_cover to be a single type tag (similar to how the img tag is a single tag) ... it seems kindlegen requires that change; and changes in mobi_k8proc.py to both ignore meta tags and stop searching for id= or the older name= attributes when searching for a link target.

In addition, I want to review the hasNCX variable, as some older mobi 4 versions (and older) do not have an ncx index. In the old days we simply did not create a toc.ncx for them, but somehow over the years that code got modified to always create a toc.ncx even though it will be empty. This will mean further code changes in the mobi_opf to deal with that remaining issue. I would like to fix that as well since your change seems to always believe this will be true but under odd circumstances, it won't be.

I also want to remove the mistaken "file-as" EXTH values in mobi-header.py and set a few new values I have found so as not to confuse others who might use this code as the basis for their own.

Hopefully, I will be able to release a stable version by Tuesday at the latest.

Take care,

KevinH

KevinH · 07-13-2014, 10:38 PM

Hi tkeo,

Here is a patch that takes v072z_test up to v073 (hopefully!). It includes your latest cumulative patch as well as your faster mobi_split patch as well as a few minor bug fixes from my end as described in my previous post. I have also made a few things a bit more consistent in the mobi_header.py code and hopefully have dealt with CoverOffset's that are 0xffffffff as well (given your earlier post on that subject).

I have decided not to play with the hasNCX stuff and not building a toc.ncx for older Mobi 4's until after the stable release as I didn't want to introduce changes that will break things.

Please give it a good testing with all of your Amazon ebooks and let me know if you feel it is now ready for a stable release.

If so, I will make the stable release Tuesday evening my time.

Thanks!

KevinH

tkeo · 07-14-2014, 08:24 AM

Hi Kevin,

Quote:

Originally Posted by KevinH

Have you tested the new mobi_split code to make sure that it is still building mobi7 and azw3 pieces completely correctly? I have not had time to look it over yet, but if you are sure, I will include it.

I have tested with 10 mobi files, 2 of which have HD images and 1 of which has no RESC. The splitted files are identical to ones generated by older mobi_split.py.

I have fixed a bug in taginfo_toxml() of mobi_k8resc.py and modified mobi_header.py.

Quote:

I also want to remove the mistaken "file-as" EXTH values in mobi-header.py and set a few new values I have found so as not to confuse others who might use this code as the basis for their own.

I have changed to

508 : 'Unknown_Title_Furigana?_(508)',
517 : 'Unknown_Creator_Furigana?_(517)',
522 : 'Unknown_Publisher_Furigana?_(522)',

in dump_contexth(cpage, extheader).
Those in class MobiHeader are not changed.

Quote:

hopefully have dealt with CoverOffset's that are 0xffffffff as well (given your earlier post on that subject).

I have modified this part too since int('0xffffffff') cannot convert to an long integer.

Code:

>>> int('0xffffffff')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: invalid literal for int() with base 10: '0xffffffff'
>>>

I attach a patch. Hopefully, it is the final patch!

BTW,
prefs.py has CRLF line ending instead of LF.

Take care,
tkeo

KevinH · 07-14-2014, 10:00 AM

Hi tkeo,

Still don't like the comparison against sys.maxint as that changes with machine. I simply want to check for one specific missing value 0xffffffff as we do with the start offset later on in KindleUnpack and many places in the header. I will fix that. If it is some other invalid value, I want to know that and let the program barf appropriately so we figure out how they have changed setting of CoverOffset. I will add my fix to the dump EXTH code as well. Also, do you have a specific testcase you use with that?

Thanks for catching the extra quotes bug in mobi_k8resc.py. I will remove the extra crs from prefs.py to keep it consistent with the other files.

Edit:

Here is how I am now handling the potentially missing CoverOffset issue (if that is what it even is). I am suspicious that someone has used an improperly written meta data editor and messed up the EXTH size fields somehow. If that is the case, I would rather we fail out as it will help us better detect where and when this is happening.

From mobi_header.py in parseMetaData(self)

Code:

        if self.hasExth:
            extheader=self.exth
            _length, num_items = struct.unpack('>LL', extheader[4:12])
            extheader = extheader[12:]
            pos = 0
            for _ in range(num_items):
                id, size = struct.unpack('>LL', extheader[pos:pos+8])
                content = extheader[pos + 8: pos + size]
                if id in MobiHeader.id_map_strings.keys():
                    name = MobiHeader.id_map_strings[id]
                    addValue(name, unicode(content, codec).encode('utf-8'))
                elif id in MobiHeader.id_map_values.keys():
                    name = MobiHeader.id_map_values[id]
                    if size == 9:
			value, = struct.unpack('B',content)
                        addValue(name, str(value))
                    elif size == 10:
                        value, = struct.unpack('>H',content)
                        addValue(name, str(value))
                    elif size == 12:
                        value, = struct.unpack('>L',content)
                        # handle special case of missing CoverOffset                                                            
                        if id != 201 or value != 0xffffffff:
                            addValue(name, str(value))
                    else:
                        print "Warning: Bad key, size, value combination detected in EXTH ", id, size, content.encode('hex')
                        addValue(name, content.encode('hex'))

Thanks,

KevinH

Quote:

Originally Posted by tkeo

Hi Kevin,

I have tested with 10 mobi files, 2 of which have HD images and 1 of which has no RESC. The splitted files are identical to ones generated by older mobi_split.py.

I have fixed a bug in taginfo_toxml() of mobi_k8resc.py and modified mobi_header.py.

I have changed to

508 : 'Unknown_Title_Furigana?_(508)',
517 : 'Unknown_Creator_Furigana?_(517)',
522 : 'Unknown_Publisher_Furigana?_(522)',

in dump_contexth(cpage, extheader).
Those in class MobiHeader are not changed.

I have modified this part too since int('0xffffffff') cannot convert to an long integer.

Code:

>>> int('0xffffffff')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: invalid literal for int() with base 10: '0xffffffff'
>>>

I attach a patch. Hopefully, it is the final patch!

BTW,
prefs.py has CRLF line ending instead of LF.

Take care,
tkeo

DiapDealer · 07-14-2014, 11:11 AM

Quote:

BTW,
prefs.py has CRLF line ending instead of LF.

Well that's odd ... but almost certainly entirely my fault.

tkeo · 07-15-2014, 08:16 AM

Hi Kevin,

Quote:

Originally Posted by KevinH

I simply want to check for one specific missing value 0xffffffff as we do with the start offset later on in KindleUnpack and many places in the header.

I misunderstood the value is a string of '0xffffffff' instead of 0xffffffff. So my modification is not necessary.

Thanks,

KevinH · 07-15-2014, 10:22 AM

Hi All,

Attached is KindleUnpack_v073.zip. KindleUnpack version 0.73 is a public release that should be stable (he said hopefully...).

There have been many recent additions to and features that are all incorporated into this release:

- RESC parsing, fixed-layout support, cover generation [Thanks tkeo]

- Unpacking to epub version 3 support if desired [Thanks to tkeo]

- Much faster mobi splitting [Thanks to tkeo]

- Greatly Improved GUI with full preferences support [Thanks to DiapDealer]

- Support for converting PAGE sections into apnx files

- Support for generating real page numbers and page-map.xml from either PAGE sections or associated .apnx files (if and only if that .apnx files was generated from real page numbers and not arbitrary values)

- Support to unpack HDCONTAINER / CRES sections and have them overwrite images that had their resolutions lowered

- lots and lots of bug fixes

Both the command line and GUI interface have been modified to support these new features.

The command line options now available are:

Code:

python kindleunpack.py [-r -s -d -h -i] [-p APNX_FILE] INPUT_FILE OUTPUT_FOLDER


   INPUT_FILE      - path to the desired Kindle/MobiPocket ebook

   OUTPUT_FOLDER   - path to folder where the ebook will be unpacked

Options:

    -h               print this help message

    -i               use HDImages to overwrite lower resolution versions, if present

    -s               split combination mobis into older mobi and mobi KF8 ebooks

    -p APNX_FILE     path to a .apnx file that contains real page numbers associated with an azw3 ebook (optional)
                     Note: many apnx files have arbitrarily assigned page offsets that will confuse KindleUnpack if used

   --epub_version=   specify epub version to unpack to: 2, 3 or A (for automatic), default is 2

    -r               write raw data to the output folder

    -d               dump headers and other debug info to output and extra files

Please give it a good workout and report any bugs here. Hope you all find this useful.

Thanks,

KevinH (for the development team)

07-12-2014, 10:48 AM	#904
tkeo Connoisseur Posts: 94 Karma: 10 Join Date: Feb 2014 Location: Japan Device: Kindle PaperWhite, Kobo Aura HD	Amazon ebook that uses EXTH 508, 517 or 522 Hi, The sample of the following book has Creator file-as and Title file-as EXTH. The sample is no-DRM. Zenyaku Genji-Monogatari (Japanese Edition) ASIN: B00BHHKABO http://www.amazon.co.jp/%E5%85%A8%E8...89%A9%E8%AA%9E http://www.amazon.com/Zenyaku-Genji-...s=genji+kindle Thanks,

Thread Tools	Search this Thread
Show Printable Version Email this Page	Search this Thread: Advanced Search

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Can i rotate text and insert images in Mobi and EPUB?	JanGLi	Kindle Formats	5	02-02-2013 04:16 PM
PDF to Mobi with text and images	pocketsprocket	Kindle Formats	7	05-21-2012 07:06 AM
Mobi files - images	DWC	Introduce Yourself	5	07-06-2011 01:43 AM
pdf to mobi... creating images rather than text	Dumhed	Calibre	5	11-06-2010 12:08 PM
Transfer of images on text files	anirudh215	PDF	2	06-22-2009 09:28 AM

07-12-2014, 10:03 AM	#902
KevinH Sigil Developer Posts: 7,627 Karma: 5433388 Join Date: Nov 2009 Device: many	Hi Paul and DiapDealer ( and anyone else with Kindle ebook collections ), Do you own an Amazon ebook that uses EXTH 508, 517 or 522? Do you know who reverse engineered those EXTH values? One quick thing to try is to cd to your My Kindle Content directory and run DumpMobiHeader_v016 (or later) on the *.azw ebook files. It will dump the EXTH even if the ebook is DRM'd since the headers themselves are not encrypted. I then redirect all the output to a big text file and then use grep to find those tag values. I would love to know if any of those supposed file-as EXTH values are ever set. If do, I will try to grab a sample of that book to see if I can figure out how they were set and why? Thanks, KevinH

07-13-2014, 08:28 PM	#909
KevinH Sigil Developer Posts: 7,627 Karma: 5433388 Join Date: Nov 2009 Device: many	Hi tkeo, Have you tested the new mobi_split code to make sure that it is still building mobi7 and azw3 pieces completely correctly? I have not had time to look it over yet, but if you are sure, I will include it. I also have found a few more bugs in KindleUnpack that I will post a patch for either later tonight my time or tomorrow. I have changes for fixing the <image> tag in the svg mobi_cover to be a single type tag (similar to how the img tag is a single tag) ... it seems kindlegen requires that change; and changes in mobi_k8proc.py to both ignore meta tags and stop searching for id= or the older name= attributes when searching for a link target. In addition, I want to review the hasNCX variable, as some older mobi 4 versions (and older) do not have an ncx index. In the old days we simply did not create a toc.ncx for them, but somehow over the years that code got modified to always create a toc.ncx even though it will be empty. This will mean further code changes in the mobi_opf to deal with that remaining issue. I would like to fix that as well since your change seems to always believe this will be true but under odd circumstances, it won't be. I also want to remove the mistaken "file-as" EXTH values in mobi-header.py and set a few new values I have found so as not to confuse others who might use this code as the basis for their own. Hopefully, I will be able to release a stable version by Tuesday at the latest. Take care, KevinH

Advert

Advert