KindleUnpack (MobiUnpack): Extracts text, images and metadata from Kindle/Mobi files - Page 44

pdurrant · 12-19-2013, 04:42 PM

Quote:

Originally Posted by KevinH

Hi,
I have a bit. It is a zip archive. Inside you will find gzipped json pieces and scripts and something that looks like the skeleton and fragments found in KF8 mobi internals. Basically the skeletons are the file frameworks and the fragments are json objects. The proper fragment for where the user is reading seems to be loaded on demand using a javascript.

So it appears to be an html5 based web application container. Try renaming the .azk to .zip, unzipping it and then gunzipping the gzipped json object files. Looks like fun!

I gave up at that point as I simply did not care enough about things. I am sure it is related to the skeletons and fragments found in the KF8.

Wow. Not something we can really support in Mobiunpack then, or need to as it's a zip.

KevinH · 12-19-2013, 07:59 PM

Quote:

Originally Posted by pdurrant

Wow. Not something we can really support in Mobiunpack then, or need to as it's a zip.

Hi Paul,

Yes but we could unzip and rebuild it similar to how we handle KF8 and make an epub-like file from it. If it really is similar to what the KF8 mobi pieces are (ie css, svg, skeletons and fragments) we should be able to do that with hopefully slight modifications to how we build an epub from an azw3 file.

"He said keeping his fingers crossed!"

Either way, definitely a project for the new year.

BTW ....

Happy Holidays / Merry Christmas!

AcidWeb · 01-10-2014, 06:00 AM

Performance issue with creating KF8 from hybrid MOBI filled with images is still on the table.

Any help will be appreciated.

Hitch · 01-10-2014, 05:57 PM

Quote:

Originally Posted by AcidWeb

Performance issue with creating KF8 from hybrid MOBI filled with images is still on the table.

Any help will be appreciated.

Is this the metadata tweaker thing?

H

AcidWeb · 01-11-2014, 02:50 AM

No. Completely other issue. Metadata tweaker was a dead end. KevinH made working code but after additional research we found that this approach is totally impractical. Using hyrbid MOBI is no-go in this case.

Current KindleUnpack method of stitching KF8 image records tremendously increase processing time when book contain only them.

Hitch · 01-11-2014, 05:01 AM

Quote:

Originally Posted by AcidWeb

No. Completely other issue. Metadata tweaker was a dead end. KevinH made working code but after additional research we found that this approach is totally impractical. Using hyrbid MOBI is no-go in this case.

Current KindleUnpack method of stitching KF8 image records tremendously increase processing time when book contain only them.

OK, so, you're having performance issues (they make pills for that, I hear), with the amount of time it takes to...what, BUILD a mobi?...from an unpacked source which has a lot of images? Can you a) confirm that's right, b) advise what you consider to be a "lot" of time, and c) give me an idea of the usual image size inside the mobi and the file size you're running?

Hitch

AcidWeb · 01-11-2014, 05:18 AM

I'm just reminding about issue that was discussed few pages ago - with all details.

Hitch · 01-11-2014, 05:03 PM

Quote:

Originally Posted by AcidWeb

I'm just reminding about issue that was discussed few pages ago - with all details.

I figured that, but I am being lazy. ;-)

h

tkeo · 02-07-2014, 08:33 AM

Hi,

I have modified the KindleUnpack package. The main aim of this modification is to be able to process right-to-left page progression books properly.

I have added:

the page progression direction attribute in a spine tag,
some id_map_strings,
K8 RESC section processing.

I attach the modified version to this post. I hope it works correctly on any environments.

I removed attached file due to bug and posted fixed one.

Thanks,

KevinH · 02-07-2014, 10:08 PM

Hi,

Thanks for your modifications. It seems a significant percent of your changes have to do with parsing the RESC section. The text direction itself is stored in the exth metadata. The cover image info is also available from exth values.

So what types of useful information are you capturing from the RESC section? For most of the examples I have seen, it is basically a small shell and not that useful.

Will you please post a small mobi azw3 style test case for rtl ebooks and a second test case that shows significant information in the RESC section that can't be found in other places in the EXTH metadata?

Also, perhaps we should pull the RESC parsing code into its own file to make the changes more self-contained and easier to follow.

Thanks,

KevinH

Quote:

Originally Posted by tkeo

Hi,

I have modified the KindleUnpack package. The main aim of this modification is to be able to process right-to-left page progression books properly.

I have added:

the page progression direction attribute in a spine tag,
some id_map_strings,
K8 RESC section processing.

I attach the modified version to this post. I hope it works correctly on any environments.

tkeo · 02-08-2014, 01:07 AM

Quote:

Originally Posted by KevinH

Hi,

Thanks for your modifications. It seems a significant percent of your changes have to do with parsing the RESC section. The text direction itself is stored in the exth metadata. The cover image info is also available from exth values.

So what types of useful information are you capturing from the RESC section? For most of the examples I have seen, it is basically a small shell and not that useful.

Will you please post a small mobi azw3 style test case for rtl ebooks and a second test case that shows significant information in the RESC section that can't be found in other places in the EXTH metadata?

Also, perhaps we should pull the RESC parsing code into its own file to make the changes more self-contained and easier to follow.

Thanks,

KevinH

Hi KevinH,

Thanks for your comments.

Yes, you are right. The text direction and the cover image info are stored in the EXTH. The most imortant infomation in the RESC I need to retrieve is the "page-spread property" in each spine itemref tag, which is necessary to show the images spaned on two pages correctly in a landscape view.
I will prepare and post an example later.

I am thinking that the cover image info and the spine itemref ids in the RESC help to make nearer the output to the source ebook processed by kindlegen. But I'm not sure someone wants or not.
I've also found "creator role" and "creator display-seq" which might make more detailed retrieval; however, I cold not found how to get correspondence between in metadata and in RECS if creators are plural.

I will consider to separate code. Currently, the modified parts of code are integrated to mobi_k8proc.py and mobi_k8opf.py, in order to find correspondeces from the spine itemrefs in the RESC to the original K8Processor class, based on skeleton ids and xhtml finenames.

Thanks,
tkeo

tkeo · 02-08-2014, 08:06 AM

Quote:

Originally Posted by tkeo

Hi,

I have modified the KindleUnpack package. The main aim of this modification is to be able to process right-to-left page progression books properly.

I have added:

the page progression direction attribute in a spine tag,
some id_map_strings,
K8 RESC section processing.

I attach the modified version to this post. I hope it works correctly on any environments.

Hi,

I found a huge bug in this modification. Some K8 ebooks are able to process but others be not. I am fixing now.

I am sorry,
tkeo

tkeo · 02-09-2014, 01:11 AM

Quote:

Originally Posted by tkeo

Hi,

I found a huge bug in this modification. Some K8 ebooks are able to process but others be not. I am fixing now.

I am sorry,
tkeo

Hi,

I've fixed bugs in the KindleUnpack v63 previously posted.

I made examples of rtl books also.
The souce (rtl_example1_src.zip) of the first one (rtl_example1.mobi) is written htmls manually. The second one (rtl_example2.mobi) is generated by Kindle Comic Creator. Both have "page-spread properties" in spine itemrefs.

BTW, I encounter a curious phenomena. Ebooks generated by Kindle Comic Creator (ex. rtl_example2.mobi) are able to unpack; however , created epub files are not accepted by kindlgen, whereas unziped kindlegensrc.zip files are accepted. This occurs v62 too.

Thanks,

KevinH · 02-09-2014, 12:28 PM

Hi tkeo,

Thank you for the examples, I will play around with them. I can't believe that a Kindle device supports the spine/page spread properties by keeping and parsing the RESC section on the fly during reading. My guess is they must include or encode that information in some other way but I that is just a guess and I could be wrong.

I had never heard of the page spread properties and so searched up on them. They seem to be specific to fixed layout and comics.

Many of the spine properties you are parsing for in the RESC are not part of the official epub 2 spec at all and are epub 3 or non-universal epub 2 extensions.

KindleUnpack tries to generate a working epub that meets epub 2 specs since as far as I know there are no true shipping epub 3 devices.

Your features technically would require an epub 3 spec book or us just adding then to epub 2 and hope that the mix works just fine. I am not sure that is the right approach.

Perhaps it would be better to create a separate version of KindleUnpack that tries its best to create an epub 3 like output since current Kindle AZW3 is someplace between epub 2 and epub 3.

Quote:

Originally Posted by tkeo

Hi,
BTW, I encounter a curious phenomena. Ebooks generated by Kindle Comic Creator (ex. rtl_example2.mobi) are able to unpack; however , created epub files are not accepted by kindlgen, whereas unziped kindlegensrc.zip files are accepted. This occurs v62 too.
Thanks,

The kindlegensrc is always a zip of what you input to kindlegen and so it can always be used and will work with that version of kindlegen.

The epub 2 like structure we generate is not from kindlegensrc but instead from reverse compiling the AZW3. If you have access to kindlegensrc then you should not need KindleUnpack unless you want to explore just how the raw AZW3 text is generated or interpreted by kindlegen.

Many times the user only has access to a shipping AZW3 or a stripped AZW3 (these will not have the SRCS section) and KindleUnpack will do its best to decompile the AZW3 back to something usable.

KindleUnpack will not generate an exact replica of the input sources nor is it even guaranteed to generate a working epub! But in most cases, if a valid verified epub2 is input into kindlegen, then KindleUnpack will generate a valid working epub2.

If the user inputs old/broken html or even old mobi 6s onto kindlegen, it will create the mobi/azw3 but when it is unpacked using KindleUnpack this software will do its best but will most likely not generate a valid epub2.

So ignoring fixed layout books for the moment and comics, do you have any test cases that show a valid epub 2 being given to kindlegen that KindleUnpack unpacks to a non-valid epub 2?

If so, I would consider them bugs and so would love to have a bug report with a testcase that shows this behaviour.

I will look closer at what you have done but as Kindlegen supports more and more epub 3 as valid input, we will need to create a new version of KindleUnpack that unpacks to an epub 3-like container and not an epub 2.

Frankly after studying the epub 3 spec, it seems the people who created the spec don't really understand what ebooks are all about and are completely missing the fundamental concept that simpler is better for all things. What a mess!

Unfortunately, right now with the various private extensions supported for fixed layout, comics, and multi-media ebooks all differing by epub vendor (Apple vs ADE/Kobo) and Amazon and the huge overhead and unnecessary complexity of epub 3, we are in some no-man's-land between official epub 2 and some fantasy epub 3.

KevinH

KevinH · 02-09-2014, 01:18 PM

Hi tkeo,

One other thing, I am not a big fan of minidom at all. It seems generally bloated and barfs if any true unicode is used (at least on 2.X). I see you wrote both a xml.dom.minidom version and a regular expression version of things. Every time I have used a xml elementTree or some other XML parser (either standard package or add-ons) in python 2.X I have run into problem cases that simply do not parse well or get confused with encodings, resulting in non-robust operation on some platforms (Mac, Win, or Linux).

So unless you feel strongly about it (and given the re vs dom code sizes are about the same), I would rather stick with regular expressions version as they are easier for people to modify and fix are are robust to most encoding issues.

I see you have also written a metadata parsing routine that supports epub 3 like "refines" on named items. This is quite nice but using it in epub 2 spec devices might cause problems.

I really think we should incorporate your code and try and create an epub 3 generator version of KindleUnpack to stay in epub 3 space and not try to mix private extensions into what is primarily epub 2 code.

What do you think?

KevinH

01-11-2014, 02:50 AM	#650
AcidWeb KCC Co-Author Posts: 845 Karma: 765434 Join Date: Mar 2013 Location: Poland Device: Kindle Oasis 2	No. Completely other issue. Metadata tweaker was a dead end. KevinH made working code but after additional research we found that this approach is totally impractical. Using hyrbid MOBI is no-go in this case. Current KindleUnpack method of stitching KF8 image records tremendously increase processing time when book contain only them. Last edited by AcidWeb; 01-11-2014 at 02:52 AM.

02-07-2014, 08:33 AM	#654
tkeo Connoisseur Posts: 94 Karma: 10 Join Date: Feb 2014 Location: Japan Device: Kindle PaperWhite, Kobo Aura HD	KindleUnpack v63 Hi, I have modified the KindleUnpack package. The main aim of this modification is to be able to process right-to-left page progression books properly. I have added: the page progression direction attribute in a spine tag, some id_map_strings, K8 RESC section processing. I attach the modified version to this post. I hope it works correctly on any environments. I removed attached file due to bug and posted fixed one. Thanks, Last edited by tkeo; 02-09-2014 at 01:21 AM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Can i rotate text and insert images in Mobi and EPUB?	JanGLi	Kindle Formats	5	02-02-2013 04:16 PM
PDF to Mobi with text and images	pocketsprocket	Kindle Formats	7	05-21-2012 07:06 AM
Mobi files - images	DWC	Introduce Yourself	5	07-06-2011 01:43 AM
pdf to mobi... creating images rather than text	Dumhed	Calibre	5	11-06-2010 12:08 PM
Transfer of images on text files	anirudh215	PDF	2	06-22-2009 09:28 AM

01-10-2014, 06:00 AM	#648
AcidWeb KCC Co-Author Posts: 845 Karma: 765434 Join Date: Mar 2013 Location: Poland Device: Kindle Oasis 2	Performance issue with creating KF8 from hybrid MOBI filled with images is still on the table. Any help will be appreciated.

01-11-2014, 05:18 AM	#652
AcidWeb KCC Co-Author Posts: 845 Karma: 765434 Join Date: Mar 2013 Location: Poland Device: Kindle Oasis 2	I'm just reminding about issue that was discussed few pages ago - with all details.

02-09-2014, 01:18 PM	#660
KevinH Sigil Developer Posts: 7,651 Karma: 5433388 Join Date: Nov 2009 Device: many	Hi tkeo, One other thing, I am not a big fan of minidom at all. It seems generally bloated and barfs if any true unicode is used (at least on 2.X). I see you wrote both a xml.dom.minidom version and a regular expression version of things. Every time I have used a xml elementTree or some other XML parser (either standard package or add-ons) in python 2.X I have run into problem cases that simply do not parse well or get confused with encodings, resulting in non-robust operation on some platforms (Mac, Win, or Linux). So unless you feel strongly about it (and given the re vs dom code sizes are about the same), I would rather stick with regular expressions version as they are easier for people to modify and fix are are robust to most encoding issues. I see you have also written a metadata parsing routine that supports epub 3 like "refines" on named items. This is quite nice but using it in epub 2 spec devices might cause problems. I really think we should incorporate your code and try and create an epub 3 generator version of KindleUnpack to stay in epub 3 space and not try to mix private extensions into what is primarily epub 2 code. What do you think? KevinH

Advert

Advert