KindleUnpack (MobiUnpack): Extracts text, images and metadata from Kindle/Mobi files - Page 52

KevinH · 06-12-2014, 04:06 PM

Quote:

Originally Posted by AcidWeb

AFAIK only Fire use new HD images.

Wow, that is strange. Hopefully an experimental version of KindleUnpack will help figure things out if it happens again.

pdurrant · 06-12-2014, 05:15 PM

Quote:

Originally Posted by AaronShep

Will there be an AppleScript version with these corrections? I can't seem to find one posted.

Thanks again for everyone's work on this.

New AppleScript Wrapper and Kevin's 0.67 archive can now be found in the first post.

msg7086 · 06-12-2014, 08:19 PM

Quote:

Originally Posted by KevinH

That is strange.

My bet is that the version from KPW somehow was referencing HD images?

KevinH

Sorry I forgot to mention that the azw3 files downloaded by and stored in KPW are only several hundreds kilobytes, not its full size (approx. 20+ MB).

I would guess that new KPW / Kindle start to use some new tech and download contents partially and progressively. (Just guess, need more research

pdurrant · 06-13-2014, 01:54 AM

Quote:

Originally Posted by msg7086

Sorry I forgot to mention that the azw3 files downloaded by and stored in KPW are only several hundreds kilobytes, not its full size (approx. 20+ MB).

I would guess that new KPW / Kindle start to use some new tech and download contents partially and progressively. (Just guess, need more research

I heard somewhere someone mention getting an azw3 and an azw6 file for some publications.

msg7086 · 06-13-2014, 02:08 AM

Quote:

Originally Posted by pdurrant

I heard somewhere someone mention getting an azw3 and an azw6 file for some publications.

Oh you are right I found a small azw3 file under documents/ and a larger azw6 under *.sdr/

KevinH · 06-13-2014, 09:15 AM

Hi,
I would be interested in understanding what an azw6 file really is. I wonder if it is either a palm container just of images or maybe a renamed zip or tar archive.

Have you tried playing around with it at all?

KevinH

NiLuJe · 06-13-2014, 11:41 PM

Huh. Color me intrigued. Does anyone have a reproducible method to get delivered such a 'split' file?

tkeo · 06-14-2014, 10:22 AM

Hi,

Quote:

Originally Posted by KevinH

So if you work converting epubs with page information, or use HD Images, I would live to get feedback as to whether it works for you or not.

Please let me know if you are willing to test it, and I will post it.

I have also made an experimental version that extract HD images. So, I would like to see your code.

Currently, I am refining the preview version for epub3, for bugfixes, clarity and other modifications.
I have changed the parameter to specify the epub version from a number to bit-flags.
Internal functions in mobi_opf.py are splited into three: for mobi7/azw4, for epub2 and for epub3. Added automatical determination of the epub version .

KevinH · 06-14-2014, 01:45 PM

Hi tkeo,

This is alpha/beta quality code for testing purposes only.

It adds experimental support for the following:

- properly unpacking PAGE information from kindlegen generated PAGE sections to create page-map.xml in mobi 8 (this only works for mobi 8 part)

- create a new HDImage folder and populate it with HDImages but does not replace the corresponding image in either the old mobi or mobi 8

- very very experimental support for passing in an associated .apnx file with an .azw3 to generate a page-map.xml file.

****Important****, this will only work for apnx files that are not just made-up offsets, but were instead generated from an epub that had actual page information provided (and proper id= tags for the page start positions).

To try unpacking an azw3/apnx combination simply:

python kindleunpack.py -d -r -p PATH_TO_APNX PATH_TO_AZW3 PATH_TO_OUTPUT_DIR

Please post any successes or failures here with any of the new features. Please note, this has only been tested and shown to work for one ebook so far

NOTE: these experimental features code have been thrown together quickly to show proof of concept. It really really needs to be cleaned up before being used for any production purposes. In fact kindleunpack.py has reached the point it should be refactored and soon. Perhaps after tkeo's epub 3 features have been integrated.

Thanks,

KevinH

KevinH · 06-14-2014, 01:50 PM

Also here is a quick and dirty python script to decode apnx files to return page names and offsets into the assembled text file (not the raw markup language file). It's command line parsing is not full unicode safe (yet). It is merely meant to demonstrate how to decode the apnx file when it is has different page numbering schemes.

Not very useful without the assembled_text.dat file but KindleUnpack's mobi_k8proc.py can be easily modified to generate that file.

NOTE: this code was thrown together very quickly to show proof of concept. It really really needs to be cleaned up before being used for any production purposes.

Hope this helps,

KevinH

Doitsu · 06-14-2014, 07:36 PM

Quote:

Originally Posted by KevinH

Also here is a quick and dirty python script to decode apnx files to return page names and offsets into the assembled text file (not the raw markup language file).

Could you eventually use the same code to create a script that'll generate a "real" .apnx file from a book that contains a pagemap instead of the fake one that Calibre generates?

Speaking of pagemaps, the Kindle publishing guidelines don't specify whether KindleGen expects Adobe style pagemaps or ncx based pagemaps.
Can you tell from the reverse engineered books, what kind of pagemap KindleGen expects?

NiLuJe · 06-14-2014, 08:43 PM

@Doitsu: I *vaguely* remember something like that being merged in Calibre recently. Did I dream that?

.

Doitsu · 06-15-2014, 03:24 AM

Quote:

Originally Posted by NiLuJe

@Doitsu: I *vaguely* remember something like that being merged in Calibre recently. Did I dream that?

.

You remember correctly, I did some research and found this related thread. It looks like Kovid updated the .apnx code in Release: 1.39 to detect embedded <mbp:pagebreak/> tags.

Quote:

Kindle driver: When generating page numbers automatically, add an additional method to detect page boundaries, using the presence of <mbp:pagebreak> tags in the source of the book. You can use this setting by right clicking on the Kindle icon in calibre when the kindle is connected and choosing customize this device.

However, unless I misunderstood the above thread this appears to be a workaround that only works with Kindle books generated by Calibre as Amazon appears to use a different method to mark up page numbers.

tkeo · 06-15-2014, 09:08 AM

Hi,

Because I don't have any books which have PAGE sections, I have just tested about HD image extraction of the exerimental version.
It seems working fine to extract HD images.

I have researched kindlegen's condition to shrink images.

The condition to shrink file size is that the size of image is larger than 819200(=800x1024)bytes perhaps. And one image's resolution is shrinked from 1600 x 2400 to 1347 x 2021 , the condition for resolution shrinking is unknown (not just from the file size). When srinking resolution, the html is not modified, so width and height in tag become not correspond to the image.

The followings is the result,

filename , source, in Images folder,
image0.jpg, 909934byte, 773356byte, has HD image
image1.jpg, 1298376byte, 365360byte, has HD image, shrinked to 1347x2021
image2.jpg, 1572736byte, 770304byte, has HD image
image3.jpg, 826567byte, 741256byte, has HD image
image4.jpg, 816893byte, 816896byte,
cover.jpg, 1645305byte, 770856byte, has HD image

I attach the test mobi.

KevinH · 06-15-2014, 03:19 PM

hi,
Kindlegen will work with both page-map.xml or page list pageTarget info in the ncx just fine.

Yes, parsing the page-map.xml or the page list could easily be done in Calibre with a little extra work and actual offsets into the calibre equivalent of the assembled text could be generated.

Alternatively, a PAGE section could be added to the generated mobi8 azw3 since the PAGE section and APNX are very very similar.
Someone would have to create a patch for calibre azw3 code but it is not hard.

Quote:

Originally Posted by Doitsu

Could you eventually use the same code to create a script that'll generate a "real" .apnx file from a book that contains a pagemap instead of the fake one that Calibre generates?

Speaking of pagemaps, the Kindle publishing guidelines don't specify whether KindleGen expects Adobe style pagemaps or ncx based pagemaps.
Can you tell from the reverse engineered books, what kind of pagemap KindleGen expects?

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Can i rotate text and insert images in Mobi and EPUB?	JanGLi	Kindle Formats	5	02-02-2013 04:16 PM
PDF to Mobi with text and images	pocketsprocket	Kindle Formats	7	05-21-2012 07:06 AM
Mobi files - images	DWC	Introduce Yourself	5	07-06-2011 01:43 AM
pdf to mobi... creating images rather than text	Dumhed	Calibre	5	11-06-2010 12:08 PM
Transfer of images on text files	anirudh215	PDF	2	06-22-2009 09:28 AM

06-13-2014, 09:15 AM	#771
KevinH Sigil Developer Posts: 7,676 Karma: 5433388 Join Date: Nov 2009 Device: many	Hi, I would be interested in understanding what an azw6 file really is. I wonder if it is either a palm container just of images or maybe a renamed zip or tar archive. Have you tried playing around with it at all? KevinH

06-13-2014, 11:41 PM	#772
NiLuJe BLAM! Posts: 13,480 Karma: 26012494 Join Date: Jun 2010 Location: Paris, France Device: Kindle 2i, 3g, 4, 5w, PW, PW2, PW5; Kobo H2O, Forma, Elipsa, Sage, C2E	Huh. Color me intrigued. Does anyone have a reproducible method to get delivered such a 'split' file?

06-14-2014, 08:43 PM	#777
NiLuJe BLAM! Posts: 13,480 Karma: 26012494 Join Date: Jun 2010 Location: Paris, France Device: Kindle 2i, 3g, 4, 5w, PW, PW2, PW5; Kobo H2O, Forma, Elipsa, Sage, C2E	@Doitsu: I vaguely remember something like that being merged in Calibre recently. Did I dream that? .