KindleUnpack (MobiUnpack): Extracts text, images and metadata from Kindle/Mobi files - Page 10

avid-e-reader · 09-05-2011, 04:11 AM

(we crossed paths in the bitstream)
So if the kindlegensrc.zip is supposedly the source files, it is not the exact source files: the directory structure is modified, and the files are tweaked to reflect the changed directory structure.

Maybe exporting any unknown binary data into files would make disassembling it a bit easier, at least. I have no clue where the .ncx file goes, or what format it gets placed in (maybe Kovid does), but without any obvious way of looking at it, it is pretty hard to figure that out. Seems reasonably likely, if there are more binary pieces than mobiunpack presently ignores, that one (or more) of them probably is the .ncx data.

And maybe others would be the .mp3 files I was asking about earlier, although sadly adding .mp3 files is something that keeps slipping further out in my project list.

siebert · 09-05-2011, 04:12 AM

Quote:

Originally Posted by avid-e-reader

But here's something I don't understand: there is also a content.opf in the kindlegensrc.zip file, but it doesn't seem to match the one generated by mobiunpack.

The kindlegensrc.zip contains the (slightly modified) sources used by kindlegen to create the mobi file. The content of kindlegensrc.zip should be sufficient to recreate the mobi file (with the exception of some fields which are not created by kindlegen based on the source).

If you have a kindlegensrc.zip, you can just ignore the remaining output of mobiunpack.

Unfortunatly most mobi files don't contain the record which contains the kindlegensrc.zip, so using the content to improve the mobiunpack output won't help in most cases.

But at least the new mobiwriter in calibre should handle ncx files, so the calibre source should give the information how the ncx content is encoded in the mobi file.

Ciao,
Steffen

pdurrant · 09-05-2011, 04:17 AM

Quote:

Originally Posted by avid-e-reader

A little more info: the <manifest> tag should probably look like:

<item href="misc/toc.ncx" id="toc" media-type="application/x-dtbncx+xml" />

but with the file name possibly different in different cases, based on the actual .ncx file name found? But here's something I don't understand: there is also a content.opf in the kindlegensrc.zip file, but it doesn't seem to match the one generated by mobiunpack.

The kindlegensrc.zip file is just extracted from the penultimate record in the Kindle ebook. It's put in there by Kindlegen, but is not actually used by any rendering software.

All the other files generated by MobiUnpack are generated by decoding the info in the Kindle ebook. In particular, the opf file is put together from bit of info in the header, EXTH records, and even from the HTML. It should have most of the info in the original opf file, but not all that info will actually be contained in the Kindle ebook.

avid-e-reader · 09-05-2011, 04:21 AM

And regarding .mp3 files, here's a sample .mobi with .mp3 that I got from somewhere, maybe with the Kindlegen documentation?

pdurrant · 09-05-2011, 04:22 AM

Quote:

Originally Posted by siebert

But at least the new mobiwriter in calibre should handle ncx files, so the calibre source should give the information how the ncx content is encoded in the mobi file.

Oooo... I wonder if the developer of that has documented it in the wiki? That would make life easier. Hmm.. apparently not. When I have some spare time I'll check the calibre sources.

siebert · 09-05-2011, 04:27 AM

Quote:

Originally Posted by pdurrant

Oooo... I wonder if the developer of that has documented it in the wiki? That would make life easier. Hmm.. apparently not. When I have some spare time I'll check the calibre sources.

As calibre can also decode a mobi, there might even exist some python code in calibre which creates the ncx file from an existing mobi.

Ciao,
Steffen

siebert · 09-05-2011, 04:30 AM

Quote:

Originally Posted by avid-e-reader

And regarding .mp3 files, here's a sample .mobi with .mp3 that I got from somewhere, maybe with the Kindlegen documentation?

While reverse-engineering should be possible having the mobi only, it would be much easier if you could provide the sources for an example book which contains a mp3 file, as someone could build two mobi files from the source (one with mp3 and one without) and analyse the differences to learn how mp3 support is encoded.

Ciao,
Steffen

avid-e-reader · 09-05-2011, 04:44 AM

The source was just an .html file and a .mp3 file (in a subdirectory named multimedia). Attached as a .zip.

DaleDe · 09-05-2011, 10:57 AM

Quote:

Originally Posted by avid-e-reader

A little more info: the <manifest> tag should probably look like:

<item href="misc/toc.ncx" id="toc" media-type="application/x-dtbncx+xml" />

but with the file name possibly different in different cases, based on the actual .ncx file name found? But here's something I don't understand: there is also a content.opf in the kindlegensrc.zip file, but it doesn't seem to match the one generated by mobiunpack.

Of course it does not match. The kindlegensrc is likely an epub source file while mobiunpack generates a mobi source file. These are not the same thing and are not even the created with the same version of the idpf. Perhaps you do not realize that there was an earlier version of eBook standards that was originally used by eBook readers as a source file. Mobi, Lit, eBookwise IMP formats were all derive from that earlier standard. See our wiki under Open eBook for more details.

Hitch · 09-05-2011, 04:53 PM

Quote:

Originally Posted by avid-e-reader

And regarding .mp3 files, here's a sample .mobi with .mp3 that I got from somewhere, maybe with the Kindlegen documentation?

The Jabberwocky mobi rather notoriously does not work. I'd say, therefore, that it's a skosh useless as an exemplar.

Hitch

siebert · 09-06-2011, 06:28 AM

Quote:

Originally Posted by avid-e-reader

The source was just an .html file and a .mp3 file (in a subdirectory named multimedia). Attached as a .zip.

I've modified this sample to add also a video file and ran mobiunpack on it.

The handling of audio and video files is almost identical to image files (surprise

The only difference is that there is a 12 byte header prepended to the original audio/video file which starts with "AUDI" or "VIDE" followed by 2 integers of unknown value.

Also quite similar to the image handling the source attributes of the html tags are replaced with the record numbers:

src="file.mp3" -> mediarecindex="00002"
poster="file.jpg" -> recindex="00003"

So it should be easy to add support for audio/video to mobiunpack.

But is audio/video support really used in the wild?

My understanding is that only very few Kindle platforms are supporting them (is there a list which shows the supported platforms?)

Ciao,
Steffen

pdurrant · 09-06-2011, 09:46 AM

Well, I took a quick look at where the ncx file might be being stored, and it turns out that when an ncx file is added to the sources, you get three extra records added to the Mobipocket file.

Here's the source NCX file, along with the three added sections of the Mobipocket file (separated out into individual files).

I don't have time to properly decode the binary formats, but if anyone fancies a puzzle, here they are. The task is to work out how to reconstruct (as best as possible) the source ncx file from the compiled binary files.

DiapDealer · 09-06-2011, 10:22 AM

Quote:

Originally Posted by pdurrant

I don't have time to properly decode the binary formats, but if anyone fancies a puzzle, here they are. The task is to work out how to reconstruct (as best as possible) the source ncx file from the compiled binary files.

For those who may be looking for insight into the ncx reconstruction from calibre source-code, I'd start with calibre/ebooks/mobi/input.py. Which will lead you to calibre/ebooks/mobi/reader.py... specifically the MobiReader class and its extract_contents function.

I can't get my head around it all quite yet, but maybe someday!

kaizoku · 09-10-2011, 11:07 AM

Getting some unknown Metadata error with this sample file. Rename the file to .azw4.

pdurrant · 09-10-2011, 01:01 PM

Quote:

Originally Posted by kaizoku

Getting some unknown Metadata error with this sample file. Rename the file to .azw4.

I think that unknown Metadata should only be showing as a warning. There is almost always some unknown metadata, as the Mobipocket/Print Replica file format is undocumented.

MobiUnpack used to ignore it, now it mentions it. You can ignore it.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Can i rotate text and insert images in Mobi and EPUB?	JanGLi	Kindle Formats	5	02-02-2013 04:16 PM
PDF to Mobi with text and images	pocketsprocket	Kindle Formats	7	05-21-2012 07:06 AM
Mobi files - images	DWC	Introduce Yourself	5	07-06-2011 01:43 AM
pdf to mobi... creating images rather than text	Dumhed	Calibre	5	11-06-2010 12:08 PM
Transfer of images on text files	anirudh215	PDF	2	06-22-2009 09:28 AM

09-05-2011, 04:11 AM	#136
avid-e-reader Member Posts: 18 Karma: 10 Join Date: Dec 2010 Device: Kindle	(we crossed paths in the bitstream) So if the kindlegensrc.zip is supposedly the source files, it is not the exact source files: the directory structure is modified, and the files are tweaked to reflect the changed directory structure. Maybe exporting any unknown binary data into files would make disassembling it a bit easier, at least. I have no clue where the .ncx file goes, or what format it gets placed in (maybe Kovid does), but without any obvious way of looking at it, it is pretty hard to figure that out. Seems reasonably likely, if there are more binary pieces than mobiunpack presently ignores, that one (or more) of them probably is the .ncx data. And maybe others would be the .mp3 files I was asking about earlier, although sadly adding .mp3 files is something that keeps slipping further out in my project list.

Advert

Advert