KindleUnpack (MobiUnpack): Extracts text, images and metadata from Kindle/Mobi files - Page 78

DiapDealer · 06-25-2015, 06:39 AM

Quote:

Originally Posted by kyzcreig

Well I'll be. This did the trick. Now the main obstacle is how I parse this jumble of HTML. It seems like some tags are cut off so I'll need a solution for that as well. So far this is turning out very nicely though!

If you're after the rendered text, you'll probably need some sort of parser that can handle malformed html. You'll also need to determine the characted encoding (usually utf8, but quite often cp1252 as well). Your best bet is to convert any html entities to their character equivalents and then parse to get the text.

kyzcreig · 06-26-2015, 12:33 PM

Yup, it turns out BeautifulSoup is very robust for this purpose. I had to increase the imprecision of the chunks and make sure I parsed out semi-complete elements but otherwise I got exactly what I needed

Super happy about this!

JSWolf · 06-29-2015, 07:03 AM

Where is the standalone version of MobiUnpack? All I see are two ZIP files with the app version. No standalone exists.

DiapDealer · 06-29-2015, 07:31 AM

Quote:

Originally Posted by JSWolf

Where is the standalone version of MobiUnpack? All I see are two ZIP files with the app version. No standalone exists.

What do you mean "standalone"? If you mean one single script, there hasn't been a single-script version of KindleUnpack since version 0.32 (which is available in the first post). If you're talking about a command-line vs gui thing, it's all one package--has been for a long, long time. Everything you need to run from the command line is in the lib folder (kindleunpack.py). Just don't move files around, the directory structure needs to remain intact. The latest version is available via the first post and github.

The ONE zip with "app" in the title is the Applescript version.

R. Scot Johns · 07-02-2015, 09:10 PM

Is there a way to remove the duplicate mobi7 images folder from the split KF8?

I understand that KU is not intended as a converter per se, but for FXL the mobi7 files are redundant in the split out KF8, and only add to the file size. It would be highly useful to be able to output a KF8 FXL with the smallest file size possible for use in distribution via non-Amazon outlets. There is no way to do this currently so far as I know.

Or are the files in this folder somehow necessary for older Kindle models? Aside from the addition of a thumbnail cover in the Mobi7 folder (and their location obviously) they are identical, and therefore redundant. And there is nothing else in there.

Fantastic work on this over the years btw. It has been of inestimable value.

DiapDealer · 07-02-2015, 10:22 PM

I don't think I'm exactly sure what you mean. The mobi7 files/folder aren't a part of the split-out KF8 file--any more than the KF8 folder is. In addition to unpacking the content into the various folders (for inspection/editing/conversion, etc...), splitting a kindlegen-created book gives you an .azw3 (representing the KF8 portion) and a .mobi (representing the mobi7 portion).

Keep the AZW3 and delete the rest if that's all you're interested in after splitting. None of the files/folders are necessary for Kindles (new or old) except for two: the .mobi and the .azw3.

Quote:

It would be highly useful to be able to output a KF8 FXL with the smallest file size possible for use in distribution via non-Amazon outlets.

Keep in mind that charging money for Kindlegen-created books on non-Amazon outlets runs afoul of the terms under which you're granted permission to use Kindlegen free of charge. Distributing for free is fine, though. I don't pretend to know how aggressively they would pursue such a license violation, I just thought you should be aware.

Hitch · 07-02-2015, 10:56 PM

Quote:

Originally Posted by DiapDealer

I don't think I'm exactly sure what you mean. The mobi7 files/folder aren't a part of the split-out KF8 file--any more than the KF8 folder is. In addition to unpacking the content into the various folders (for inspection/editing/conversion, etc...), splitting a kindlegen-created book gives you an .azw3 (representing the KF8 portion) and a .mobi (representing the mobi7 portion).

Keep the AZW3 and delete the rest if that's all you're interested in after splitting. None of the files/folders are necessary for Kindles (new or old) except for two: the .mobi and the .azw3.

Keep in mind that charging money for Kindlegen-created books on non-Amazon outlets runs afoul of the terms under which you're granted permission to use Kindlegen free of charge. Distributing for free is fine, though. I don't pretend to know how aggressively they would pursue such a license violation, I just thought you should be aware.

AFAIK, "non-Amazon outlets," such as they are, don't charge delivery fees. Therefore, this seems like a lot of brain-damage, for no return whatsoever. Secondly, depending upon the type of FXL you're making, you won't have a KF7 portion (e.g., Comix, and kids' books) assuming, of course, that you're making the book the usual way, as opposed to using something like Calibre. Third, Amazon only charges delivery fees on the portion of the file that's delivered, e.g., the KF8 if it's a FXL, so...again, not sure as to the "why" here?

The only "non-Amazon outlets" that I know of don't ACCEPT FXL files for sale. (I guess that would be "fourth."). Where are you planning to sell this? Maybe someone here can help with that, if they have experience there.

FWIW.

Hitch

R. Scot Johns · 07-02-2015, 11:14 PM

Quote:

Originally Posted by DiapDealer

The mobi7 files/folder aren't a part of the split-out KF8 file--any more than the KF8 folder is. In addition to unpacking the content into the various folders (for inspection/editing/conversion, etc...), splitting a kindlegen-created book gives you an .azw3 (representing the KF8 portion) and a .mobi (representing the mobi7 portion).

This is not true from what I'm seeing. If you unpack the resulting .azw3 file that is split out it results in both a mobi8 folder (with complete file structure and the recreated epub file), plus a mobi7 folder which contains a duplicate set of images. You will also now get the HDImages in their separate folder if you don't select the "Use HD Images if Present".

Is KindleUnpack creating this mobi7 folder on extracting the .azw3, or is this part of its internal files?

The reason I was checking is due to the file size of the resulting .azw3, which is the same size as the unsplit mobi file. Is this due to the presence of the source epub in the KF8 version, or both HD and compressed images? My assumption, based on the extracted .azw3 is that the mobi7 folder is still in there. But perhaps I'm misinterpreting the data here.

Quote:

Originally Posted by DiapDealer

Keep in mind that charging money for Kindlegen-created books on non-Amazon outlets runs afoul of the terms under which you're granted permission to use Kindlegen free of charge. Distributing for free is fine, though. I don't pretend to know how aggressively they would pursue such a license violation, I just thought you should be aware.

By "non-Amazon" outlets I am referring in this case to distribution of my own books on my own website, which I have done now for seven or eight years, both for sale and as gratis promos. Many authors do, as Amazon are well aware. But you are correct that legally speaking files produced by KindleGen can only be distributed outside Amazon for "non-commercial" purposes.

DiapDealer · 07-03-2015, 09:15 AM

Quote:

Originally Posted by R. Scot Johns

This is not true from what I'm seeing. If you unpack the resulting .azw3 file that is split out it results in both a mobi8 folder (with complete file structure and the recreated epub file), plus a mobi7 folder which contains a duplicate set of images. You will also now get the HDImages in their separate folder if you don't select the "Use HD Images if Present".

Yes. This is because the dual-format kindlegen .mobi shares the images between the two formats.

I think you may be confusing what is getting unpacked with what is being included in the in the .mobi and .azw3 files after being split. Even a standalone azw3 file (no mobi7 component present) will produce a mobi7 folder (with images) when unpacked with KindleUnpack. That's just the way it works

Quote:

Is KindleUnpack creating this mobi7 folder on extracting the .azw3, or is this part of its internal files?

As mentioned above the mobi7 folder is entirely a product of KindleUnpack's process for unpacking the Kindlebook's content. There is no "mobi7" folder (or any files/folders for that matter) inside a binary Kindlebook.

Quote:

The reason I was checking is due to the file size of the resulting .azw3, which is the same size as the unsplit mobi file. Is this due to the presence of the source epub in the KF8 version, or both HD and compressed images? My assumption, based on the extracted .azw3 is that the mobi7 folder is still in there. But perhaps I'm misinterpreting the data here.

When you say "it's the same size," do you mean exactly, or are you ball-parking? With an image-intensive fixed-layout kindlebook, it wouldn't surprise me if the resulting azw3 wasn't that much smaller than the original dual-format mobi (because of the sharing of images between formats mentioned above), but it should still be smaller because the original source is no longer included in the resulting azw3. Are you absolutely sure you're starting with a dual-format, Kindlegen-created file? Does Kindlegen even create a mobi-only portion when compiling a fixed-layout Kindlebook?

A dual-format kindlebook (created with Kindlegen) that is unpacked with KindleUnpack (with the 'split' box checked) should result in a mobi7 folder (with contents other than just images--opf, html, ncx, etc...), a mobi8 folder (containing an OEBPS file structure and an epub file), and an HDImages folder (that may or may not contain images). In addition to those three folders and their contents, there should be four other files (if the 'Split' option was selected) at the same directory level as the three above-mentioned folders:

a mobi8-<something,something>.azw3 file
a mobi7-<something,something>.mobi file
a kindlegensrc.zip file
a kindlegenbuild.log file

Without the 'Split' option selected, the results would be the same--with the exception of the last four files not being present.

If those three folders, and those four files aren't present after unpacking (with the split option checked), then you didn't have a dual-format, kindlegen-created file to begin with (either that or Amazon have changed the kindlegen output in a way that KindleUnpack doesn't yet account for).

Quote:

By "non-Amazon" outlets I am referring in this case to distribution of my own books on my own website, which I have done now for seven or eight years, both for sale and as gratis promos. Many authors do, as Amazon are well aware. But you are correct that legally speaking files produced by KindleGen can only be distributed outside Amazon for "non-commercial" purposes.

That's fine. I just wanted to make sure you were aware. That decision is entirely yours (and the other authors).

R. Scot Johns · 07-03-2015, 10:38 AM

Quote:

Originally Posted by DiapDealer

Even a standalone azw3 file (no mobi7 component present) will produce a mobi7 folder (with images) when unpacked with KindleUnpack. That's just the way it works

But why? What is the point in producing a mobi7 component on extract from the split azw3 if it's not in there in the first place? That creates an inaccurate representation of the source content.

Quote:

Originally Posted by DiapDealer

When you say "it's the same size," do you mean exactly, or are you ball-parking?

source file mobi: 107,299 KB
split out azw3: 107,281 KB
split out mobi7: 39,155 KB

So not exact, but close enough. Both effectively 104 MB. Minus only 18 kb of data. Moreover, the two split files combine to greater than the source content, so something is being duplicated that should not, if there is no mobi7 component in the azw3 as you say.

Quote:

Originally Posted by DiapDealer

Are you absolutely sure you're starting with a dual-format, Kindlegen-created file?

Yes, I produced the file myself with Kindlegen, as I have with thousands of others. I should mention that I'm the author of a book on the subject of fixed layout for Kindle, just to make it clear I am not unversed on this topic.

Also, just for reference, when creating the source file with Kindlegen I used -dont_append_source, so there should be no source files present internally to begin with.

Quote:

Originally Posted by DiapDealer

Does Kindlegen even create a mobi-only portion when compiling a fixed-layout Kindlebook?

It must, since it lists the mobi7 deliverable file size in the output log during conversion, even when you're making a KF8 FXL.

Quote:

Originally Posted by DiapDealer

A dual-format kindlebook (created with Kindlegen) that is unpacked with KindleUnpack (with the 'split' box checked) should result in a mobi7 folder (with contents other than just images--opf, html, ncx, etc...), a mobi8 folder (containing an OEBPS file structure and an epub file), and an HDImages folder (that may or may not contain images). In addition to those three folders and their contents, there should be four other files (if the 'Split' option was selected) at the same directory level as the three above-mentioned folders:

a mobi8-<something,something>.azw3 file
a mobi7-<something,something>.mobi file
a kindlegensrc.zip file
a kindlegenbuild.log file

Without the 'Split' option selected, the results would be the same--with the exception of the last four files not being present.

If those three folders, and those four files aren't present after unpacking (with the split option checked), then you didn't have a dual-format, kindlegen-created file to begin with (either that or Amazon have changed the kindlegen output in a way that KindleUnpack doesn't yet account for).

Yes, that is exactly what I have after unpack, as always. I've been using KU for a number of years now, so I know how it works. I was just asking if the reason the split azw3 is so big is that it still contains the mobi7 component. But you have answered that.

My understanding, then, just to clarify, is that the split azw3 contains both sets of images for dual-format (low-res/hi-res for sending to HD/non-HD devices), but no actual mobi7 component, and that this is just created upon unpack/split by KU, drawing from (i.e. duplicating on extract) the low-res/compressed image files, thereby producing more actual content than was really in there in the first place (obviously, since the combined files are greater than the source). I just was unclear as to what exactly was happening, but you have answered that. So thanks.

DiapDealer · 07-03-2015, 11:49 AM

Quote:

Originally Posted by R. Scot Johns

But why? What is the point in producing a mobi7 component on extract from the split azw3 if it's not in there in the first place? That creates an inaccurate representation of the source content.

It doesn't matter for your purposes. The content in the mobi7 folder does not affect the size or the content of your resulting azw3. And it's not "producing a mobi7 component on extract from the split azw3." It's producing all of the content from both the mobi7 portions of the original file and the kf8 portions of the original file. Selecting the 'Split' box doesn't change that. The split feature is a completely separate function from the unpacking process. Meaning that the split-out files are not created from the unpacked content (nor is the unpacked data being extracted from the split-out azw3/mobi). Unpacking and Splitting are completely independent of each other. The standalone KindleUnpack tool just doesn't give one the option to Split without also unpacking everything.

Quote:

My understanding, then, just to clarify, is that the split azw3 contains both sets of images for dual-format (low-res/hi-res for sending to HD/non-HD devices), but no actual mobi7 component, and that this is just created upon unpack/split by KU, drawing from (i.e. duplicating on extract) the low-res/compressed image files, thereby producing more actual content than was really in there in the first place (obviously, since the combined files are greater than the source).

Sort of (except for the "more content than was really there in the first place" part) ... but since the unpacked content has absolutely no bearing on the two files (.mobi and .azw3) produced when splitting, it doesn't really matter.

Quite simply put: if all you're interested in is splitting to get a stand-alone AZW3, the rest of the extracted data is irrelevant.

Quote:

I just was unclear as to what exactly was happening, but you have answered that. So thanks.

Glad to help.

R. Scot Johns · 07-03-2015, 01:52 PM

Quote:

Originally Posted by DiapDealer

And it's not "producing a mobi7 component on extract from the split azw3." It's producing all of the content from both the mobi7 portions of the original file and the kf8 portions of the original file.

Let me clarify this a bit. What I did was unpack the .azw3 that was unpacked from the original mobi file. So it's a second unpacking, as it were. If you unpack and split the source mobi, then unpack the resulting split azw3, you get a mobi7 folder full of images (but nothing else) as well as all the normal content in the mobi8 folder, the combined size of which is larger than the azw3 source, even discounting the recreated epub3.

Therefore, the unpack process in this case is actually creating a mobi7 component from the split KF8 file, if, as you say, it is not present in the split azw3 file. However, I believe it actually is present in the azw3, at least as a housing for the compressed images that are sent to non-HD devices.

The reasoning for this is due to some confusion on my part over the function of the "Use HD Images" option, and its resulting output. Here is what I'm seeing when unpacking the split azw3 with/without this option ticked:

Use HD Images:
* BOTH mobi7 and mobi8 folders contain HD images

Do NOT use HD Images:
* NEITHER mobi7 or mobi8 folder contain HD images (i.e. both have compressed jpegs)
* HDImages folder created containing HD images

In neither of these cases do the resulting files equal the input file size (even discounting the produced epub3 source file, the size of which is relative to the images being used).

However, upon further inspection/calculation, the combination of one file from each iteration (i.e. one HD image folder plus one non-HD image folder) does equal almost exactly the input file size. This leads me to believe that both are present in the source azw3, but are not being extracted accurately (that is, with one folder containing each version of the images). Only when unticking "Use HD Images" do you get both, but in this case you actually get two sets of compressed jpegs as well as the original size images in their separate folder. This is where my confusion lay.

As my understand has been that the purpose of KU is to ascertain what exactly is occurring in the source conversion, my natural presumption was that the unpacked content reflected what was actually in that file (with the caveats for the reproduced epub structure files). Perhaps I was expecting more fidelity than is intended.

It is ultimately not important at this point, other than as an academic exercise, which is always useful to further understanding in my experience. Otherwise, I walking around looking like this:

which is not uncommon.

Thanks for bearing with me as I muddle through this. Mostly I could have worked it out on my own, but sometimes it's easier just to ask.

KevinH · 07-05-2015, 06:49 PM

Hi,
You really don't seem to understand what an mobi ebook actually is, so please let me try to explain a bit.

It is a compiled ebook format that uses a palm database structure - a set of starting offsets to binary data referred to either as sections or records depending on who you ask. What KindleUnpack does is examine these binary sections, identifies any sections that are headers and then use them to identify starting section numbers where images are stored, text is stored, index information, and etc. and then extract them to files. The data from these files are used to create html3.2 code that can be used to input back into kindlegen for the older mobi 7 pieces and used to create an epub-like structure for the kf8 pieces. If you actually want to see the rawml you can dump that as well.

The header sections also have EXTH records that contain the MetaData information. If you want to understand the exact layout of the mobi file, simply run DumpMobiHeader_v018.py or later and look at the description of what is stored in each section of the palm database file.

For joint mobis, the images are not duplicated, they are stored after the mobi7 header and before the kf8 header. Later mobis can also have a completely separate container of HDImages and placeholders.

When Kindleunpack unpacks image sections (and fonts and RESC sections) it stores them all in a mobi 7 folder and copies the correct piecs to the mobi 8 folder as needed. When Kindleunpack unpacks from the HD Container, it will store these images in their own HDImage folder as they can notbe shared with a mobi 7. There is a switch to have the HDImages overwrite their low resolution cousins.

So please run DumpMobiHeader and examine the section map to see what is actually being stored i side the palm database structure.

If you have further questions, post the output of DumpMobiHeader from running on your mobi so that I understand exactly what it is you are askng. It will even work on DRMd ebooks since the headers themselves and most images are not typically encrypted.

Hope this helps,

KevinH

R. Scot Johns · 07-05-2015, 07:37 PM

Yeah, thanks Kevin, that clarifies a lot. This is not exactly stuff your average ebook creator knows, or even needs to know, or is very easy to find out for that matter. This is why we ask questions. If I already knew the answers I wouldn't bother. It's not as if it's innate information we're all born with. Most content creators don't even know how to use Kindlegen, let alone KindleUnpack. I was just trying to understand exactly how it works, based on the evidence I had before me, so thank you for your explanation.

pdurrant · 07-06-2015, 03:43 AM

Quote:

Originally Posted by R. Scot Johns

Yeah, thanks Kevin, that clarifies a lot. This is not exactly stuff your average ebook creator knows, or even needs to know, or is very easy to find out for that matter. This is why we ask questions. If I already knew the answers I wouldn't bother. It's not as if it's innate information we're all born with. Most content creators don't even know how to use Kindlegen, let alone KindleUnpack. I was just trying to understand exactly how it works, based on the evidence I had before me, so thank you for your explanation.

See also the wiki page on the file format for the gory bit-level details.

07-02-2015, 09:10 PM	#1160
R. Scot Johns Author/Illustrator Posts: 14 Karma: 2952 Join Date: Mar 2012 Location: Boise, ID Device: iPad 2 & 3, Kindle Paperwhite, Kindle Fire 1 & 2, HD7 & HD8.9, RazrMax	Redundant Mobi7 in split KF8 FXL Is there a way to remove the duplicate mobi7 images folder from the split KF8? I understand that KU is not intended as a converter per se, but for FXL the mobi7 files are redundant in the split out KF8, and only add to the file size. It would be highly useful to be able to output a KF8 FXL with the smallest file size possible for use in distribution via non-Amazon outlets. There is no way to do this currently so far as I know. Or are the files in this folder somehow necessary for older Kindle models? Aside from the addition of a thumbnail cover in the Mobi7 folder (and their location obviously) they are identical, and therefore redundant. And there is nothing else in there. Fantastic work on this over the years btw. It has been of inestimable value.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Can i rotate text and insert images in Mobi and EPUB?	JanGLi	Kindle Formats	5	02-02-2013 04:16 PM
PDF to Mobi with text and images	pocketsprocket	Kindle Formats	7	05-21-2012 07:06 AM
Mobi files - images	DWC	Introduce Yourself	5	07-06-2011 01:43 AM
pdf to mobi... creating images rather than text	Dumhed	Calibre	5	11-06-2010 12:08 PM
Transfer of images on text files	anirudh215	PDF	2	06-22-2009 09:28 AM

06-26-2015, 12:33 PM	#1157
kyzcreig Enthusiast Posts: 33 Karma: 12694 Join Date: Aug 2014 Device: kindle paperwhite	Yup, it turns out BeautifulSoup is very robust for this purpose. I had to increase the imprecision of the chunks and make sure I parsed out semi-complete elements but otherwise I got exactly what I needed Super happy about this!

06-29-2015, 07:03 AM	#1158
JSWolf Resident Curmudgeon Posts: 73,983 Karma: 128903378 Join Date: Nov 2006 Location: Roslindale, Massachusetts Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3	Where is the standalone version of MobiUnpack? All I see are two ZIP files with the app version. No standalone exists.

07-05-2015, 06:49 PM	#1168
KevinH Sigil Developer Posts: 7,644 Karma: 5433388 Join Date: Nov 2009 Device: many	Hi, You really don't seem to understand what an mobi ebook actually is, so please let me try to explain a bit. It is a compiled ebook format that uses a palm database structure - a set of starting offsets to binary data referred to either as sections or records depending on who you ask. What KindleUnpack does is examine these binary sections, identifies any sections that are headers and then use them to identify starting section numbers where images are stored, text is stored, index information, and etc. and then extract them to files. The data from these files are used to create html3.2 code that can be used to input back into kindlegen for the older mobi 7 pieces and used to create an epub-like structure for the kf8 pieces. If you actually want to see the rawml you can dump that as well. The header sections also have EXTH records that contain the MetaData information. If you want to understand the exact layout of the mobi file, simply run DumpMobiHeader_v018.py or later and look at the description of what is stored in each section of the palm database file. For joint mobis, the images are not duplicated, they are stored after the mobi7 header and before the kf8 header. Later mobis can also have a completely separate container of HDImages and placeholders. When Kindleunpack unpacks image sections (and fonts and RESC sections) it stores them all in a mobi 7 folder and copies the correct piecs to the mobi 8 folder as needed. When Kindleunpack unpacks from the HD Container, it will store these images in their own HDImage folder as they can notbe shared with a mobi 7. There is a switch to have the HDImages overwrite their low resolution cousins. So please run DumpMobiHeader and examine the section map to see what is actually being stored i side the palm database structure. If you have further questions, post the output of DumpMobiHeader from running on your mobi so that I understand exactly what it is you are askng. It will even work on DRMd ebooks since the headers themselves and most images are not typically encrypted. Hope this helps, KevinH

07-05-2015, 07:37 PM	#1169
R. Scot Johns Author/Illustrator Posts: 14 Karma: 2952 Join Date: Mar 2012 Location: Boise, ID Device: iPad 2 & 3, Kindle Paperwhite, Kindle Fire 1 & 2, HD7 & HD8.9, RazrMax	Yeah, thanks Kevin, that clarifies a lot. This is not exactly stuff your average ebook creator knows, or even needs to know, or is very easy to find out for that matter. This is why we ask questions. If I already knew the answers I wouldn't bother. It's not as if it's innate information we're all born with. Most content creators don't even know how to use Kindlegen, let alone KindleUnpack. I was just trying to understand exactly how it works, based on the evidence I had before me, so thank you for your explanation.

Advert

Advert