KindleUnpack (MobiUnpack): Extracts text, images and metadata from Kindle/Mobi files - Page 17

pdurrant · 01-12-2012, 01:45 PM

Quote:

Originally Posted by lizcastro

Thanks, Kevin! This is so helpful.

Can you confirm that the only thing mobi_unpack does is show what was in the mobi file? It doesn't generate anything, right?

When I convert an EPUB file to mobi with KindleGen2, and then unpack it with your latest version of mobi_unpack, I get a folder that contains a smaller version of the EPUB file than the original, an HTML file with what looks like the contents of the entire book, along with an ncx and opf file, and a folder with reduced size images.

Then, there's a K8 folder that contains a completely re-engineered set of files, all renamed, resized images, etc. of what was originally in my EPUB file.

And then there's a kindlegensrc.zip file, that when unzipped, contains my original unaltered files.

It all seems so excessive.

thanks,
Liz

The only new thing that Mobiunpack creates is the epub, which is generated from the K8 folder. The HTML, ncx, opf and folder of images are the mobipocket version, the K8 is the new Kindle Format 8 version and the kindlegensrc.zip are indeed your original files which are also in the Mobipocket file.

Yes, the output from the new KindleGen does contain the Mobipocket, KF8 and your source files, all wrapped up in one.

KevinH · 01-12-2012, 01:48 PM

Hi Liz,

It does unpack and generate things so that the end user could edit the files and drop them back on kindlegen to recreate a modified mobi.

The new kindlegen creates mobis (palm database files) that actually have two completely different versions of the ebook inside it (and I am not referring to the kindlegensrc.zip which may also stored there).

The first is the original mobi format ebook and immeidately after it is the new K8 mobi ebook all stored in the same .mobi palm database file.

So older technology can read the .mobi file from the top and see it as a normal mobi. Newer technology can then detect that this is a compound mobi file and actually open the second half which is the K8 formatted (html5 - basically a variation of an epub) to get all of the new features.

Right now, mobi_unpack.py will create in the output folder the following:

1. from the old part of the .mobi it will create the source mobi markup (old html) and images that will allow the user to edit it any way they want and drop it back on kindlegen.

2. if the kindlegensrc.zip record is present it will unpack it so that the user can see the actual source ebook file (typically an epub) given to kindlegen. This record is typically removed by Amazon but is actually created by Kindlegen.

3. from the K8 version of the .mobi, it will create the K8 folder and inside it all of the images and fonts, and xhtml source files that were used to create it. A user who did not have access to the kindlegensrc.zip could edit this and then drop it on kindlegen to create a new/altered version of the ebook (fix typos, etc).

4. From the K8 pieces, it actually will build a complete epub which is stores as well.

You can then compare the epub created from the K8 against the kindlegensrc.zip (typically an epub) to see what is anything the kindlegen processing changed.

All of this requires rebuilding and generation. The actual binary format inside of the mobi file needs to be decoded to make something that is usable in some way. If you want to see what the actual raw files look like, you can use NotePad+ or any good text editor to change one line near the top of mobi_unpack.py that will write out all of the raw text pieces as well.

So it is simply not something that dumps sections from the palm database file. It actually does that (the raw file) and then rebuilds it to try to get back to the original source so that authors and people can more easily edit their books and recreate mobi output using Kindlegen.

It is also useful for understanding the internal format of the new .k8 mobis and what if any tags are created and used.

If you have any other questions just ask.

Take care,

Kevin

Quote:

Originally Posted by lizcastro

Thanks, Kevin! This is so helpful.

Can you confirm that the only thing mobi_unpack does is show what was in the mobi file? It doesn't generate anything, right?

When I convert an EPUB file to mobi with KindleGen2, and then unpack it with your latest version of mobi_unpack, I get a folder that contains a smaller version of the EPUB file than the original, an HTML file with what looks like the contents of the entire book, along with an ncx and opf file, and a folder with reduced size images.

Then, there's a K8 folder that contains a completely re-engineered set of files, all renamed, resized images, etc. of what was originally in my EPUB file.

And then there's a kindlegensrc.zip file, that when unzipped, contains my original unaltered files.

It all seems so excessive.

thanks,
Liz

lizcastro · 01-12-2012, 02:10 PM

Fascinating! Thanks so much for the info. And for mobi_unpack itself.

I find the fact that the mobi file contains a non-KF8 version, a KF8 version AND the original EPUB particularly interesting.

And I hate the way all the files get renamed! I assume that's KindleGen and not mobi_unpack.

Are either of you on Twitter? I'd love to follow you.

best,
Liz

KevinH · 01-12-2012, 03:08 PM

Hi Liz,

Inside the .mobi there are no file names at all. Each font, image, etc is just stored in section of the database (with no name info) and referred to from the processed html (i.e all links are converted to section numbers in the .mobi palm database).

So all "names" are created by us (either based on the title) or simply numbered with img0001.jpg, font0002.ttf, part0004.xhtml, etc. We have no way of knowing what the original name was, whether it was a chapter, or section or ....

That is the main reason we need to re-generate things. Even in the older mobis, the mobi markup html that was input to kindlegen was processed to remove links, store images in sections, etc, and so we must reverse that to get back to something that can be edited by users.

As for twitter - I am too old to deal with anything new ;-)

But I am sure Paul, or DiapDealer or any of the other contributors from this forum topic (mobi_unpacker is really the joint effort of a lot of people) would be happy to answer any questions.

Take care,

KevinH

Quote:

Originally Posted by lizcastro

And I hate the way all the files get renamed! I assume that's KindleGen and not mobi_unpack.

Are either of you on Twitter? I'd love to follow you.

best,
Liz

lizcastro · 01-12-2012, 03:18 PM

Whoa. I didn't realize. I sort of knew that mobi was this big mass of data, but didn't realize to what extent. So, if I understand correctly, mobi_unpack reverse engineers the mobi and then generates what the individual files would look like if they were individual files?

So it's not KindleGen that renames them, it's mobi_unpack, but it does so because it has no other choice, since the names are lost in the conversion to mobi?

But the kindlegensrc.zip file actually comes from a real, existing EPUB that's sitting there in the mobi file created by KindleGen?

Going to set WRITE_RAW_DATA to True now to see what happens.

thanks!

Liz

lizcastro · 01-12-2012, 03:31 PM

Why does mobi_unpack generate an EPUB file?

DiapDealer · 01-12-2012, 03:45 PM

Quote:

Why does mobi_unpack generate an EPUB file?

I will defer to Kevin for the final say on this question, but for myself... mobi_unpack generates an epub because the KF8 format itself is basically nothing more than a binary representation of an epub.

So since the original source won't be part of a commercially available, DRM-Free KF8 ebook, mobi_unpack decompiles the KF8 data into a familiar, standard, editable format that can be easily modified (or examined) with existing tools/programs and then fed right back to kindlegen.

KevinH · 01-12-2012, 04:07 PM

Hi,

Yes, exactly as DiapDealer said!

It is nice to have the kindlegensrc.zip but ebooks downloaded from Amazon won't have that. Amazon strips it off (and if they keep it they could start selling epubs if they ever wanted to as well).

So mobi_unpacker tries to recreate the original epub as close as it can be based on the K8 information (which is xhtml based with normal css that is essentially an epub with the main bits merged into one file with links replaced and a few other modifications).

Take a look at the _k8.raw file in a text editor to see what the kindlegen actually stores inside. You can find the css info stored at the end (inline) with any svg moved to there as well. You can see how they have replaced links with base 32 numbered references, added their own aid="", etc.

The mobi_unpacker figures out how to reverse all of that to get back to as close to an epub as possible since that is the input format for kindlegen.

Take care,

Kevin

Quote:

Originally Posted by DiapDealer

I will defer to Kevin for the final say on this question, but for myself... mobi_unpack generates an epub because the KF8 format itself is basically nothing more than a binary representation of an epub.

So since the original source won't be part of a commercially available, DRM-Free KF8 ebook, mobi_unpack decompiles the KF8 data into a familiar standard editable format that can be easily modified (or examined) with existing tools/programs and then fed right back to kindlegen.

lizcastro · 01-12-2012, 04:11 PM

Hmm. I don't see the _k8.raw file. When I used WRITE_RAW_DATA=True, the only thing I got different was a .rawml file, but it looks a lot like the .html file on the non-kf8 side. Should I have modified some other setting?

KevinH · 01-12-2012, 04:17 PM

Hi,

Look for a file inside the K8 directory that is named after the title of the book and ends with .rawml (I used to call it _k8.raw but then moved it to inside the K8 so that it would not impact the raw version from the older mobi part of the ebook).

You should find the css at the end, links changed, aid="" placed in tags to augment the original id="", etc.

For fun you can look at the .rawml version outside of the K8 directory. It is how the original mobi markup language got processed by kindlegen. Check out the links, how styles are inlined, etc.

lizcastro · 01-12-2012, 04:42 PM

I see. Interesting.

Here's another question. If I'm selling mobi files directly, how do I get rid of the original EPUB? It seems like it would make the file unnecessarily large.

KevinH · 01-12-2012, 04:53 PM

Hi,

I believe Paul has a kindlegensrc stripper someplace? Try searching this Mobi forum and you should find a thread about it.

Ahh ... there is a KindleStrip program (next thread down I believe) that does what you want but it has not yet been updated to deal with the new kindlegen. I am sure someone here will soon patch it to make it work. And perhaps expand it to remove the K8 or older mobi parts as well. It makes no sense to ship so many copies of the ebook, it is just generating bloat.

lizcastro · 01-12-2012, 05:05 PM

Thanks!

pdurrant · 01-12-2012, 06:26 PM

Quote:

Originally Posted by KevinH

Ahh ... there is a KindleStrip program (next thread down I believe) that does what you want but it has not yet been updated to deal with the new kindlegen. I am sure someone here will soon patch it to make it work. And perhaps expand it to remove the K8 or older mobi parts as well. It makes no sense to ship so many copies of the ebook, it is just generating bloat.

Of course, Amazon want you to use KindleGen to create files to send to them to sell. They not really concerned about people using it for private uses. so the bloat doesn't matter. I'm sure that when they come to send the book out they'll strip it down to Mobi or KF8 (depending on the device it's being sent to), not both.

KevinH · 01-14-2012, 05:05 PM

Hi,

I have had access to more samples (including the fixed layout Children's sample) and therefore have:

- added support for image files used in CSS sheets (needed for fixed layout)

- modified the unpacker to deal with the extra metadata fields used by fixed-layout ebooks
"RegionMagnification", "fixed-layout",
"book-type", "orientation-lock", "original-resolution"

- identified the BOUNDARY section number

- builds the epub from the K8 pieces with compression now

- fixed the mobi_k8proc.py class code to be better encapsulated (added accessor methods)

- fixed support for older mobis with no ncx

So attached is the very latest version of the experimental mobi_unpack.py program.

python ./mobi_unpack.py Jerome.mobi test/

PS: I have just updated the .zip attachment with all bug fixes I know about so far including some additional support for guide elements.

PPS: I have again now updated the .zip attachment to support multiple @import url statements in css.

PPPS: removed since DiapDealer has posted the latest version later on in this thread.

01-12-2012, 04:17 PM	#250
KevinH Sigil Developer Posts: 9,093 Karma: 6404930 Join Date: Nov 2009 Device: many	Hi, Look for a file inside the K8 directory that is named after the title of the book and ends with .rawml (I used to call it _k8.raw but then moved it to inside the K8 so that it would not impact the raw version from the older mobi part of the ebook). You should find the css at the end, links changed, aid="" placed in tags to augment the original id="", etc. For fun you can look at the .rawml version outside of the K8 directory. It is how the original mobi markup language got processed by kindlegen. Check out the links, how styles are inlined, etc. Last edited by KevinH; 01-12-2012 at 04:23 PM.

01-12-2012, 04:53 PM	#252
KevinH Sigil Developer Posts: 9,093 Karma: 6404930 Join Date: Nov 2009 Device: many	Hi, I believe Paul has a kindlegensrc stripper someplace? Try searching this Mobi forum and you should find a thread about it. Ahh ... there is a KindleStrip program (next thread down I believe) that does what you want but it has not yet been updated to deal with the new kindlegen. I am sure someone here will soon patch it to make it work. And perhaps expand it to remove the K8 or older mobi parts as well. It makes no sense to ship so many copies of the ebook, it is just generating bloat. Last edited by KevinH; 01-12-2012 at 04:56 PM.

01-14-2012, 05:05 PM	#255
KevinH Sigil Developer Posts: 9,093 Karma: 6404930 Join Date: Nov 2009 Device: many	new version of experimental K8 mobi_unpack.py Hi, I have had access to more samples (including the fixed layout Children's sample) and therefore have: - added support for image files used in CSS sheets (needed for fixed layout) - modified the unpacker to deal with the extra metadata fields used by fixed-layout ebooks "RegionMagnification", "fixed-layout", "book-type", "orientation-lock", "original-resolution" - identified the BOUNDARY section number - builds the epub from the K8 pieces with compression now - fixed the mobi_k8proc.py class code to be better encapsulated (added accessor methods) - fixed support for older mobis with no ncx So attached is the very latest version of the experimental mobi_unpack.py program. python ./mobi_unpack.py Jerome.mobi test/ PS: I have just updated the .zip attachment with all bug fixes I know about so far including some additional support for guide elements. PPS: I have again now updated the .zip attachment to support multiple @import url statements in css. PPPS: removed since DiapDealer has posted the latest version later on in this thread. Last edited by KevinH; 01-18-2012 at 11:28 AM. Reason: removed old zip DiapDealer has posted the latest version

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Can i rotate text and insert images in Mobi and EPUB?	JanGLi	Kindle Formats	5	02-02-2013 05:16 PM
PDF to Mobi with text and images	pocketsprocket	Kindle Formats	7	05-21-2012 08:06 AM
Mobi files - images	DWC	Introduce Yourself	5	07-06-2011 02:43 AM
pdf to mobi... creating images rather than text	Dumhed	Calibre	5	11-06-2010 01:08 PM
Transfer of images on text files	anirudh215	PDF	2	06-22-2009 10:28 AM

01-12-2012, 02:10 PM	#243
lizcastro Member Posts: 16 Karma: 148 Join Date: Apr 2010 Device: iPad, NOOK, Kindle, Kobo	Fascinating! Thanks so much for the info. And for mobi_unpack itself. I find the fact that the mobi file contains a non-KF8 version, a KF8 version AND the original EPUB particularly interesting. And I hate the way all the files get renamed! I assume that's KindleGen and not mobi_unpack. Are either of you on Twitter? I'd love to follow you. best, Liz

01-12-2012, 03:18 PM	#245
lizcastro Member Posts: 16 Karma: 148 Join Date: Apr 2010 Device: iPad, NOOK, Kindle, Kobo	Whoa. I didn't realize. I sort of knew that mobi was this big mass of data, but didn't realize to what extent. So, if I understand correctly, mobi_unpack reverse engineers the mobi and then generates what the individual files would look like if they were individual files? So it's not KindleGen that renames them, it's mobi_unpack, but it does so because it has no other choice, since the names are lost in the conversion to mobi? But the kindlegensrc.zip file actually comes from a real, existing EPUB that's sitting there in the mobi file created by KindleGen? Going to set WRITE_RAW_DATA to True now to see what happens. thanks! Liz

01-12-2012, 03:31 PM	#246
lizcastro Member Posts: 16 Karma: 148 Join Date: Apr 2010 Device: iPad, NOOK, Kindle, Kobo	Why does mobi_unpack generate an EPUB file?

01-12-2012, 04:11 PM	#249
lizcastro Member Posts: 16 Karma: 148 Join Date: Apr 2010 Device: iPad, NOOK, Kindle, Kobo	Hmm. I don't see the _k8.raw file. When I used WRITE_RAW_DATA=True, the only thing I got different was a .rawml file, but it looks a lot like the .html file on the non-kf8 side. Should I have modified some other setting?

01-12-2012, 04:42 PM	#251
lizcastro Member Posts: 16 Karma: 148 Join Date: Apr 2010 Device: iPad, NOOK, Kindle, Kobo	I see. Interesting. Here's another question. If I'm selling mobi files directly, how do I get rid of the original EPUB? It seems like it would make the file unnecessarily large.

01-12-2012, 05:05 PM	#253
lizcastro Member Posts: 16 Karma: 148 Join Date: Apr 2010 Device: iPad, NOOK, Kindle, Kobo	Thanks!

Advert

Advert