KindleUnpack (MobiUnpack): Extracts text, images and metadata from Kindle/Mobi files - Page 53

ATimson · 06-15-2014, 10:27 PM

Quote:

Originally Posted by NiLuJe

Huh. Color me intrigued. Does anyone have a reproducible method to get delivered such a 'split' file?

This is one example of such a book, that gets delivered split to my Paperwhite 2; let me know if you want me to look for more in my collection.

KevinH · 06-17-2014, 12:52 PM

Hi All,

Attached is a new version of KindleUnpack v071:

New features include:

- HDImages are now parsed and extracted. Ebook authors can choose
to use them to manually replace non-HD images if they so desire
(see the new HDImages folder)

- kindlegen generated PAGE sections are now used to create a
proper page-map.xml in the Mobi 8 section if present in the .mobi

- experimental support for page-maps contained in associated APNX files
Only for AZW3 (Mobi 8) ebooks

- NOTE: Many apnx files are just arbitrary page start offsets and will
therefore just confuse KindleUnpack. If the APNX was generated
based on actual page start positions (with the proper id_tags)
KindleUnpack stands a good chance of dealing with them
(compare them against the printed book to see if they are real)

- CONT Headers are now recognized and their associated EXTH metadata
can be dumped (using the dump option).

- KindleUnpack.pyw (Tk GUI for KindleUnpack) has been updated to allow
passing in of optional apnx files

- KindleUnpack_ReadMe.htm has also been updated with the new options

- Improved Palm Section Maps/Descriptions in DUMP mode to reduce the number
of unknown data dumps generated and hopefully allow new section types to be
more easily detected in the future.

Thanks to DiapDealer and Tkeo for testing and help debugging the new features.

Please report any bugs or issues here and I will try to deal with them. There may be inadvertent breakage of older features due to the refactoring but hopefully all will be well.

Co-Developers: I have completely refactored the code because the kindleunpack.py file was simply getting too cumbersome to deal with and could not be easily followed or read. There are now more associated mobi_*.py library files. Long routines have been split into more easily followed and understood pieces, etc.

Given the large number of resulting changes, I will not be posting a full diff.
That said, the code in kindleunpack.py is now hopefully much more readable and supportable.

Tkeo: I have changed very little in the mobi_opf.py file so most of your epub3 changes will hopefully still apply with little to minor fixes. If not, let me know and I will help hand apply them.

Tkeo: Also, I would like your help moving much of the RESC support code from kindleunpack.py back into mobi_k8resc.py by simply passing in k8proc (to prevent the needless copying over of structures stored in k8proc). Thanks!

KevinH · 06-17-2014, 01:18 PM

tkeo,

Since the bulk of the refactoring is now complete, I am ready to merge in your final epub 3 changes. Please let me know if my refactoring caused any trouble with your changes.

If not, feel free to incorporate your latest epub3 support changes into v071 to create v071a and post it for testing.

Hopefully, that will bring KindleUnpack up to date with everything we know about Kindle file formats.

Thanks,

KevinH

KevinH · 06-17-2014, 01:23 PM

KindleUnpack Co-Developers and Interested Parties,

There is still a lot we do not know in case anyone wants to jump in ...

1. the kindlegen generated CONT section (HD_CONTAINER) is actually a full Header of some sort with lots of unknown fields and its own EXTH section. I have added the code to dump the new EXTH section but the fields and what they mean are at unknown.

2. how to unpack an azk file generated for iPhone (it appears to a zip archive with a set of gzipped json objects (skeleton and fragments) and other pieces (similar to a azw3 skeleton and fragments?)

3. what an azw6 file is and how to unpack it
I am hoping since they are paired with azw3 pieces, that azw6 files represent a set of HDImages store inside some kind of container (see the CONT section info above). But this is just a wild guess until we get our hands on one.

So if anyone likes to reverse-engineer things, please take a shot at any of the above.

Thanks,

KevinH

DaleDe · 06-17-2014, 04:07 PM

As to azk. I have placed some information in the wiki that should help in unpacking this file.

Dale

tkeo · 06-18-2014, 08:11 AM

Hi Kevin,

Firstly, I would like to appreciate updating KindleUnpack with new
features. I will modify my epub3 supported version to fit to the the
newer version.

I have found and fixed bugs. (I have not changed the version since it is on the experimental stage.)

Thanks,
tkeo

tkeo · 06-18-2014, 08:43 AM

Kevin,

I would like to ask you for a modification about refactoring.

Could you allow me

to remove the imgnames pamareter and change the return value from imgnames to imginfo whose structure is [dir, imgname, type, secno, dataoffset, data(=None)], to functions listed below, and appending it in process_all_mobi_headers(),

or

to change the parameter from imgnames to imglist(= list of the imginfo)?

The list of functions to modify:

processSRCS(), processPAGE(), processCMET(), processFONT(), pocessCRES(), processCONT(), processkind(), processRESC(), processImage().

Because I am considering to move all calling of write() except for DUMP into process_all_mobi_headers() in order to make easier to understand and writing files to mobi8 folder directly instead of copying files from mobi7 folder, in addition, creating epub files from the imglist.
I think it will make easy to support HD images. Since to make the epub for the HD images, recreating XHTML files or renaming the file names of the HD images are required.
I attach the newest preview version I have, as a reference.

Thanks,

KevinH · 06-18-2014, 09:48 AM

Hi tkeo,

Great work! It seems My refactoring had broken a number of things.
I will apply your patch and release v072 asap.

Thanks,
Kevin

Quote:

Originally Posted by tkeo

Hi Kevin,

Firstly, I would like to appreciate updating KindleUnpack with new
features. I will modify my epub3 supported version to fit to the the
newer version.

I have found and fixed bugs. (I have not changed the version since it is on the experimental stage.)

Thanks,
tkeo

KevinH · 06-18-2014, 10:06 AM

Hi tkeo,

If we did we should probably have to rename it to resource_info since it would need to store fonts, images, HDImages, possibly your RESC, and also the pageMap info as well.

Also, I don't think we should be passing around lists with the actual data in it. Image and font data can be quite big, especially when all we need is the name and the type and where it is stored.

If you don't like storing them in the mobi7 folder first then, I guess I don't understand why we can't simply write the files to a neutral location as we read them. Perhaps a base Images/ and HDImages/ and then in processMobiX put them in the proper location?

Also for the mobi 8 we want to create both an epub file and leave it unpacked in place so that users can see what is there more easily.

I will take a look at your v067 code to get a better idea of how you are using it.

Thanks,

Kevin

Quote:

Originally Posted by tkeo

Kevin,

I would like to ask you for a modification about refactoring.

Could you allow me

to remove the imgnames pamareter and change the return value from imgnames to imginfo whose structure is [dir, imgname, type, secno, dataoffset, data(=None)], to functions listed below, and appending it in process_all_mobi_headers(),

or

to change the parameter from imgnames to imglist(= list of the imginfo)?

The list of functions to modify:

processSRCS(), processPAGE(), processCMET(), processFONT(), pocessCRES(), processCONT(), processkind(), processRESC(), processImage().

Because I am considering to move all calling of write() except for DUMP into process_all_mobi_headers() in order to make easier to understand and writing files to mobi8 folder directly instead of copying files from mobi7 folder, in addition, creating epub files from the imglist.
I think it will make easy to support HD images. Since to make the epub for the HD images, recreating XHTML files or renaming the file names of the HD images are required.
I attach the newest preview version I have, as a reference.

Thanks,

KevinH · 06-18-2014, 11:20 AM

Hi All;,

Yes my refactoring had broken a number of things which tkeo has caught and fixed!

So here is a bug fix release KindleUnpack_v072a

Bugs Fixed by tkeo and DiapDealer
- Print Replicas should now work again
- RESC section processing should now work again
- Bug fix for page-map processing encodings
- obfuscating/mangling of previously obfuscated fonts should now work again

Attached is KindleUnpack_v072a.zip

.

KevinH · 06-18-2014, 12:22 PM

Hi tkeo,

I looked at your patch from v067 to add epub3 support and your comments.

I have a few things that I don't understand and therefore would like to discuss:

- why do you want to move writing of files back to process_all_mobi_headers?
In general, an object should know how and where it should write itself. Therefore, NAV, OPF, NCX etc should all know where and how to create themselves from the passed in data and the "files" object.

In fact, my inclination is to move the writing of the text/html files out of even processMobiX and out to header specific routines.

- your partslist[] simply duplicates much of what is in k8proc so I don't understand why it is needed. We can simply pass the k8proc object along if you need access to that information.

- you seem to duplicate information from k8proc to pull into k8resc. It would simply be easier to pass in the k8proc object is and where you need that information.

- your datalist[] simply duplicates much of what imgnames is used for and you never store any raw data in that list anyway. Why is the data offset information there if the data itself is never stored? Why do you need the section number? The directory and type information can easily be deduced from the file name extension, and metadata for (offsets).

- why do we need to know the width and height of the images? Is this needed to create svg based cover pages?

In general, I think it is easy in the opf to know where something is when we know what name and extension it has because we pass in our unpack structure files object and imgnames list.

Now if you really think you must have additional information, lets focus on reusing the data structures in k8proc for the text, css, and svg pieces and not add or use partslist[].

We could also simply rename and imgnames to resource_names or something similar since they can deal with fonts and the like, although a simple directory listing of the output file structure can tell you everything you need to know as well.

At minimum, resource_names would need, filename with extension (which is what it has now) and from the extension, the opf would know where it should be located and what it is.

But if you need more we should go for something as minimal as possible and not reinvent yet another data structure to pass around. From what I could see I simply do not think we need the section number, data offset, or data itself, ever. So let's figure out the very minimum needed and use that.

- Also do we really need to rename the cover image as "cover"?

- Also Can't we simply use the EmptyImagePlaceholders and the kindleembed string from the "kind" section and info from CONT metadata to overwrite the the corresponding non-HD image file with the correct HDImage but keep its original name so nothing in the OPF needs to care.

The CONT and following sections up to the container boundary is a simple one-to-one mapping from the first image to the last HDImage with empty place holders being used to indicate images that do not have HD replacements (so they only need to keep track of things up to the last HDImage.

Please let me know what you think. We can discuss this further via personal messaging on this site so as to not spam the list if you so desire.

Take care and Thanks for all of your hard work!

KevinH

Quote:

Originally Posted by tkeo

Kevin,

I would like to ask you for a modification about refactoring.

Could you allow me

to remove the imgnames pamareter and change the return value from imgnames to imginfo whose structure is [dir, imgname, type, secno, dataoffset, data(=None)], to functions listed below, and appending it in process_all_mobi_headers(),

or

to change the parameter from imgnames to imglist(= list of the imginfo)?

The list of functions to modify:

processSRCS(), processPAGE(), processCMET(), processFONT(), pocessCRES(), processCONT(), processkind(), processRESC(), processImage().

Because I am considering to move all calling of write() except for DUMP into process_all_mobi_headers() in order to make easier to understand and writing files to mobi8 folder directly instead of copying files from mobi7 folder, in addition, creating epub files from the imglist.
I think it will make easy to support HD images. Since to make the epub for the HD images, recreating XHTML files or renaming the file names of the HD images are required.
I attach the newest preview version I have, as a reference.

Thanks,

JSWolf · 06-18-2014, 12:42 PM

But do we really need or even want ePub 3 support given that most devices we use don't support ePub 3. Will it still generate ePub 2 as well as ePub 3 or would be be stuck with ePub 3 only?

JSWolf · 06-18-2014, 12:44 PM

Quote:

Originally Posted by KevinH

Hi All;,

Yes my refactoring had broken a number of things which tkeo has caught and fixed!

So here is a bug fix release KindleUnpack_v072

Bugs Fixed by tkeo:
- obfuscating/mangling of previously obfuscated fonts should now work again

Could we have it so fonts are just left unobfuscated when all is done?

pdurrant · 06-18-2014, 12:53 PM

Quote:

Originally Posted by JSWolf

Could we have it so fonts are just left unobfuscated when all is done?

No. All epub readers understand obfuscated fonts.

KevinH · 06-18-2014, 02:11 PM

Hi,
Yes, tkeo's epub3 changes allows the user to select epub2 or epub3 or even allow it to auto select based on features. So no worries there.

Quote:

Originally Posted by JSWolf

But do we really need or even want ePub 3 support given that most devices we use don't support ePub 3. Will it still generate ePub 2 as well as ePub 3 or would be be stuck with ePub 3 only?

06-17-2014, 01:18 PM	#783
KevinH Sigil Developer Posts: 7,654 Karma: 5433388 Join Date: Nov 2009 Device: many	tkeo, Since the bulk of the refactoring is now complete, I am ready to merge in your final epub 3 changes. Please let me know if my refactoring caused any trouble with your changes. If not, feel free to incorporate your latest epub3 support changes into v071 to create v071a and post it for testing. Hopefully, that will bring KindleUnpack up to date with everything we know about Kindle file formats. Thanks, KevinH Last edited by KevinH; 06-17-2014 at 01:23 PM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Can i rotate text and insert images in Mobi and EPUB?	JanGLi	Kindle Formats	5	02-02-2013 04:16 PM
PDF to Mobi with text and images	pocketsprocket	Kindle Formats	7	05-21-2012 07:06 AM
Mobi files - images	DWC	Introduce Yourself	5	07-06-2011 01:43 AM
pdf to mobi... creating images rather than text	Dumhed	Calibre	5	11-06-2010 12:08 PM
Transfer of images on text files	anirudh215	PDF	2	06-22-2009 09:28 AM

06-17-2014, 01:23 PM	#784
KevinH Sigil Developer Posts: 7,654 Karma: 5433388 Join Date: Nov 2009 Device: many	KindleUnpack Co-Developers and Interested Parties, There is still a lot we do not know in case anyone wants to jump in ... 1. the kindlegen generated CONT section (HD_CONTAINER) is actually a full Header of some sort with lots of unknown fields and its own EXTH section. I have added the code to dump the new EXTH section but the fields and what they mean are at unknown. 2. how to unpack an azk file generated for iPhone (it appears to a zip archive with a set of gzipped json objects (skeleton and fragments) and other pieces (similar to a azw3 skeleton and fragments?) 3. what an azw6 file is and how to unpack it I am hoping since they are paired with azw3 pieces, that azw6 files represent a set of HDImages store inside some kind of container (see the CONT section info above). But this is just a wild guess until we get our hands on one. So if anyone likes to reverse-engineer things, please take a shot at any of the above. Thanks, KevinH

06-17-2014, 04:07 PM	#785
DaleDe Grand Sorcerer Posts: 11,470 Karma: 13095790 Join Date: Aug 2007 Location: Grass Valley, CA Device: EB 1150, EZ Reader, Literati, iPad 2 & Air 2, iPhone 7	As to azk. I have placed some information in the wiki that should help in unpacking this file. Dale

06-18-2014, 12:42 PM	#792
JSWolf Resident Curmudgeon Posts: 74,037 Karma: 129333114 Join Date: Nov 2006 Location: Roslindale, Massachusetts Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3	But do we really need or even want ePub 3 support given that most devices we use don't support ePub 3. Will it still generate ePub 2 as well as ePub 3 or would be be stuck with ePub 3 only?