View Full Version : KindleUnpack (MobiUnpack): Extracts text, images and metadata from Kindle/Mobi files


Pages : [1] 2 3 4

adamselene
11-12-2009, 01:50 PM
Most of this post now by pdurrant.

KindleUnpack is a set of python script that takes a Kindle/Mobipocket ebook and extracts the HTML, images and metadata contained in the ebook, and puts them in a form suitable for passing to KindleGen.

For KF8 files and combined Mobipocket and KF8 files, it also can produce separated mobipocket and KF8 files, and also the original source files if those are included in the ebook. In addition, for KF8 files it can produce an 'ePub', although if the HTML isn't compliant with ePub standards, the 'ePub' won't be either.

For Amazon's .azw4 files, it will extract the PDF that's been wrapped up in Amazon's .azw4 file format.

Downloads available:
Version 0.73 of the python scripts (http://www.mobileread.com/forums/attachment.php?attachmentid=125429&stc=1&d=1405529949) (including .pyw graphics front end)
Version 0.67 of a drag&drop AppleScript version (http://www.mobileread.com/forums/attachment.php?attachmentid=124072&stc=1&d=1402607644).
A calibre plugin version of the scripts is available in this thread (http://www.mobileread.com/forums/showthread.php?t=171529).

For anyone not interested in KindeGen and KF8, there's a copy of the last version of the single-file script, mobiunpack 0.32 (http://www.mobileread.com/forums/attachment.php?attachmentid=89514&d=1342902594).

The name of the script was changed to KindleUnpack with version 0.6.1.

The Python scripts are released under GPLv3. The AppleScript Wrapper is released with unlicense (http://unlicense.org/).

Many thanks to adamselene for the base code which has been built on by many of the participants of this thread.

pdurrant



[Original Post:]
I reimplemented huff/cdic compression in Python, and did a few other things while I was at it. The new script:

* decompresses about 25x faster than mobihuff.py
* uses much less memory (about 16x on my largest test file)
* implements conversion of uncompressed and Palmdoc-compressed files
* handles trailing data correctly in all cases

Check it out: http://www.mit.edu/afs/athena/user/m/y/mycroft/mobiunpack.py

PLEASE NOTE that this tool is only for decompressing unencrypted Mobipocket files. It does not decrypt DRMed files. Do not ask me for help breaking DRM.

adamselene
11-13-2009, 11:22 PM
The latest version (0.07, same location) is even faster—now about 50x as fast as mobihuff.py.

quocsan
11-14-2009, 07:20 AM
Great job!
Thank you, Adamselene.

HansTWN
11-15-2009, 09:04 PM
time to get working on those Topaz files! Wink, wink!

pdurrant
02-05-2010, 12:37 PM
PLEASE NOTE that this tool is only for decompressing unencrypted Mobipocket files. It does not decrypt DRMed files. Do not ask me for help breaking DRM.

Many thanks for this. I have moved the latest versions into the first post in this thread (http://www.mobileread.com/forums/showthread.php?t=61986) now. (Being a moderator has some advantages.)

soalla
02-05-2010, 01:33 PM
thanks to both of you!!

pdurrant
02-05-2010, 06:05 PM
I've now tweaked the script to also output the images.

Note that the HTML file is the raw contents of the Mobipocket file, and so the img attributes in it aren't proper HTML, and don't point to the extracted images. To get working images in the HTML, a bit of search/replace will be needed, although it should be possible to do it with a single grep, as I've tried to make the file names easy to use with what's in the HTML file.

Jellby
02-06-2010, 04:04 AM
Pssst, remove the __MACOSX directory ;)

pdurrant
02-06-2010, 10:54 AM
Pssst, remove the __MACOSX directory ;)

OK, done. Saved 48 bytes!

pdurrant
02-09-2010, 03:03 PM
Tweaked again, mostly by some_updates from the Dark Reverser's blog comments, to output some of the metadata from the file.

I've added to his work by getting the metadata output as an opf file resembling the original file used to generate the Mobipocket file.

However, the raw output of the 'html' in the Mobipocket file need a fair bit of work on it yet before it'll be possible to regenerate the file using Mobipocket Creator or KindleGen.

That's my eventual aim with this, however.

pdurrant
02-18-2010, 10:46 AM
However, the raw output of the 'html' in the Mobipocket file need a fair bit of work on it yet before it'll be possible to regenerate the file using Mobipocket Creator or KindleGen.

That's my eventual aim with this, however.

By taking code from other sources and tweaking it, Version 0.17 (above) now creates an opf file, a folder of images, and an html file that are ready for use with Mobipocket Creator.

Simply opening the opf file with Mobipocket Creator, and choosing to build re-creates the original Mobipocket book. Of course, it also means that it's easy to correct any typos in the HTML file first.

quocsan
02-19-2010, 05:33 AM
By taking code from other sources and tweaking it, Version 0.17 (above) now creates an opf file, a folder of images, and an html file that are ready for use with Mobipocket Creator.

Simply opening the opf file with Mobipocket Creator, and choosing to build re-creates the original Mobipocket book. Of course, it also means that it's easy to correct any typos in the HTML file first.

Thank you pdurrant!
You have done a great job!
B.T.W, Could you please make a small Python script that can change eBook metadata (e.g eBook's title)?
Sometimes, we need to change the titles for grouping eBooks.
Thank you in advance for your attention.

mbovenka
02-19-2010, 12:30 PM
Thank you pdurrant!
You have done a great job!
B.T.W, Could you please make a small Python script that can change eBook metadata (e.g eBook's title)?
Sometimes, we need to change the titles for grouping eBooks.
Thank you in advance for your attention.

If you have Calibre installed, 'ebook-meta' will do what you want. If you haven't, you should :-).

pdurrant
02-19-2010, 05:57 PM
Thank you pdurrant!
You have done a great job!
B.T.W, Could you please make a small Python script that can change eBook metadata (e.g eBook's title)?
Sometimes, we need to change the titles for grouping eBooks.
Thank you in advance for your attention.

I use Mobiperl for that (with a little Applescript wrapper).

quocsan
02-19-2010, 10:27 PM
I use Mobiperl for that (with a little Applescript wrapper).

I see. But I meant title in UNICODE (eBook => 'Sách Điện Tử' in Vietnamese).
MobiPerl cannot deal with title in UNICODE.
I have changed eBooks' titles with WinHex. But I dislike to do that by hand.
OK, I'll try with ... Google.
Thank you for attention.

pdurrant
02-20-2010, 04:25 AM
I see. But I meant title in UNICODE (eBook => 'Sách Điện Tử' in Vietnamese).
MobiPerl cannot deal with title in UNICODE.
I have changed eBooks' titles with WinHex. But I dislike to do that by hand.
OK, I'll try with ... Google.
Thank you for attention.

Oh - having double-checked, I find I was wrong in my initial assumption - that title couldn't be in unicode. If the book is set up as a UTF-8 book, then the title field in the first record is stored and interpreted as UTF-8.

Of course, that means that in Windows Latin-1 Mobipocket books it's interpreted as Windows Latin-1 encoded. So it's not possible to give a unicode name to just any Mobipocket book, but it should be possible on Unicode Mobipocket books.

I have much too much on at the moment, but it's an interesting idea.

quocsan
02-20-2010, 06:08 AM
Oh - having double-checked, I find I was wrong in my initial assumption - that title couldn't be in unicode. If the book is set up as a UTF-8 book, then the title field in the first record is stored and interpreted as UTF-8.

Of course, that means that in Windows Latin-1 Mobipocket books it's interpreted as Windows Latin-1 encoded. So it's not possible to give a unicode name to just any Mobipocket book, but it should be possible on Unicode Mobipocket books.

I have much too much on at the moment, but it's an interesting idea.

That's correct, pdurrant!
In fact whenever I want to change an ebook's title, I have to
1) Edit the title in NotePad and save it as text file in UTF-8.
2) View the text file in hexadecimal (with Total Commander) and copy the title in hexa string.
3) Open the eBook in WinHex, locate the original title and replace it with the title in step 2. Then find and update the title length in the eBook.

The above steps can help me to change UNICODE title if the eBook encoded in UTF-8 format, but not work with 1252 code page.

So I hope someone can write a Python script to do those steps.
And why Python? Because then I can run the script on my Nokia phone with Python for S60. I always enjoy eBooks on my Nokia E-series (E71).

The easiest way to have the eBook's title as we want is use MobiReader desktop version. With this, we can easily change the title and have the title saved in .MBP file. But then the eBook has to come along with .MBP file. I use this approach for eBooks encoded in 1252 format.

angelad
02-23-2010, 11:10 AM
HOw widely used is Python these days?

pdurrant
02-23-2010, 11:29 AM
HOw widely used is Python these days?

Judging from Google Trends, more popular now than Perl, but still well below c++ and php:

http://www.google.co.uk/trends?q=Python%2C+Perl%2C+php%2C+c%2B%2B

Comparing just to Perl, we see that Perl dropped below Python in popularity at the end of 2007.

http://www.google.co.uk/trends?q=Python%2C+Perl%2C

kovidgoyal
02-23-2010, 12:15 PM
I doubt that's very accurate. python is hardly exclusive to the computer language (while c++ and perl are).

pdurrant
02-23-2010, 02:56 PM
I doubt that's very accurate. python is hardly exclusive to the computer language (while c++ and perl are).

Granted. How about this one?

http://www.google.co.uk/trends?q=perl%2C+python+-snake+-monty+-animal+-reptile+-skin+-pet

which I think removes most other meanings of python.

kovidgoyal
02-23-2010, 03:11 PM
Much better :) I'm happy to see Python usage rising. Another useful comparison is with Ruby.

pdurrant
03-19-2010, 11:31 AM
By taking code from other sources and tweaking it, Version 0.17 (above) now creates an opf file, a folder of images, and an html file that are ready for use with Mobipocket Creator.

Version 0.20, now uploaded above, improves the opf file and the HTML file.

quocsan
03-20-2010, 05:23 AM
Version 0.20, now uploaded above, improves the opf file and the HTML file.

Oh, it's really v0.21!
Thank you.

pdurrant
03-20-2010, 09:55 AM
Oh, it's really v0.21!
Thank you.

So it is - I got it right in one message, and forgot the version number by the time I wrote the second! :-)

pdurrant
04-12-2010, 04:21 AM
[Bug fix so that it works right with Mobipocket files containing more than 9 images]
Now here's version 22.


Another update, to fix a silly bug that got the image links wrong if there were more than nine images.

adamselene
07-25-2010, 11:36 AM
I have to admit I kind of dropped this on the floor after my initial flurry. Thanks for the additional work, pdurrant.

One of the things that has always been a little irritating about Mobipocket (compared with ePub) is that it hard-codes file offsets in links, making it problematic to fix errors in an eBook if you don't have good tools. It's clearly seen as a display format (like PostScript) rather than a source format.

Thankfully, ePub is a lot more sane.

pdurrant
07-26-2010, 09:12 AM
I have to admit I kind of dropped this on the floor after my initial flurry. Thanks for the additional work, pdurrant.

One of the things that has always been a little irritating about Mobipocket (compared with ePub) is that it hard-codes file offsets in links, making it problematic to fix errors in an eBook if you don't have good tools. It's clearly seen as a display format (like PostScript) rather than a source format.

Thankfully, ePub is a lot more sane.

Without your initial work, I couldn't have done anything. Tweaking python — fine, I can do that. Writing this from scratch? No way...

I think that mobiunpack now allows unpacking, editing and re-packing (with KindleGen) without any problems. If anyone does come across any problems doing this, I'd love to hear about them so that they can be fixed.

What's nice with ePub is that (with careful choice of attributes in the opf file), it's possible to create a valid ePub and use the ePub source folder to create a well-formed Mobipocket ebook using KindleGen.

pdurrant
08-29-2010, 05:48 PM
[Enhancement: Now includes Start guide item in the opf]
Now here's version 23.

Just uploaded version 23. Mobipocket books can include a pointer to where the book should open to when first opened (often the first page of the first chapter, skipping all the prelims).

Version 23 of MobiUnpack now writes this info out in the .opf file, so that it's preserved if the Mobipocket file is re-built from mobiunpack's output using Kindlegen.

quocsan
09-02-2010, 02:12 AM
Good job, thank you pdurrant!

phalla12
09-15-2010, 02:53 PM
any one know how to find the app for NOKIA 6688 which use S60?
Thanks to help me out.

pdurrant
09-15-2010, 05:42 PM
any one know how to find the app for NOKIA 6688 which use S60?
Thanks to help me out.

http://www.mobipocket.com/en/DownloadSoft/application.asp?device=SymbianOs

There's one Mobipocket Reader application for all the Symbian versions. I don't think there's a Kindle application for Symbian.

quocsan
09-20-2010, 10:21 PM
Dear pdurrant,
Since our last discussion about changing eBooks' title, I have finally coded in VC++ a program (less than 50 Kb) that can help us change title in UNICODE.
It bases on skindle-06 (part of tools_v1.9.zip concerning your work).
Now I want to change/add other data such as Author, but I don't know PERL language.
How can I do that with eBook with/without EXTH data?
Could you please give some help?

Thank you in advance.

JSWolf
09-20-2010, 10:23 PM
Dear pdurrant,
Since our last discussion about changing eBooks' title, I have finally coded in VC++ a program (less than 50 Kb) that can help us change title in UNICODE.
It bases on skindle-06 (part of tools_v1.9.zip concerning your work).
Now I want to change/add other data such as Author, but I don't know PERL language.
How can I do that with eBook with/without EXTH data?
Could you please give some help?

Thank you in advance.

Basing it on Skindle is a bad idea as Skindle has bugs that can cause corruption. Base it on something that actually is bug free. If you keep it based on Skindle, nobody will want to risk using it.

quocsan
09-20-2010, 10:34 PM
Basing it on Skindle is a bad idea as Skindle has bugs that can cause corruption. Base it on something that actually is bug free. If you keep it based on Skindle, nobody will want to risk using it.

Oh, I only base on pieces of code that help parsing PALMDOC / MOBI header. Nothing else.
You can try it.

pdurrant
09-21-2010, 04:23 AM
Dear pdurrant,
Since our last discussion about changing eBooks' title, I have finally coded in VC++ a program (less than 50 Kb) that can help us change title in UNICODE.
It bases on skindle-06 (part of tools_v1.9.zip concerning your work).
Now I want to change/add other data such as Author, but I don't know PERL language.
How can I do that with eBook with/without EXTH data?
Could you please give some help?

Thank you in advance.

tools_v1.9.zip isn't mine, nor is skindle.

If you want to write something that changes more of the metadata, you need to know what metadata is stored where and in what formats. Some of that info is in the wiki, http://wiki.mobileread.com/wiki/MOBI and some you'll have to work out yourself, since if it isn't in the wiki, I don't know about it.

If you create a C/C++ based program, I'd recommend releasing it with full source. Perhaps are understandably reluctant to run strange code on their machines.

Personally, I'd rather see Python scripts, since they can usually run on Windows/Mac/Linux, and are inherently source code.

quocsan
09-21-2010, 05:04 AM
Thank you for your reply, pdurrant.
I mentioned you because I found in file skindle-06\ReadMe the paragraph:

The DarkReverser - thanks for mobidedrm! The last part of this
is just a C port of mobidedrm.

For changing metadata, I'll try.
When everything is stable, I'll post full source code.

pdurrant
09-21-2010, 05:41 AM
Thank you for your reply, pdurrant.
I mentioned you because I found in file skindle-06\ReadMe the paragraph:

For changing metadata, I'll try.
When everything is stable, I'll post full source code.

I'm not the Dark Reverser, either!

GeoffC
09-21-2010, 05:56 AM
I'm not the Dark Reverser, either!


:snicker:

pdurrant
09-21-2010, 08:15 AM
:snicker:

Of course, the danger in giving explicit denials is that people might start asking lots of questions about who else I might be on-line.

I think in the future I'd best stick to "No comment".

For example, if I was asked whether GeoffC was my sock-puppet, I'll now have to reply "No Comment".
:thumbsup:

GeoffC
09-21-2010, 08:32 AM
Shock horror ! (I'm not?)

:snicker:

quocsan
09-21-2010, 09:02 AM
Of course, the danger in giving explicit denials is that people might start asking lots of questions about who else I might be on-line.

I think in the future I'd best stick to "No comment".

For example, if I was asked whether GeoffC was my sock-puppet, I'll now have to reply "No Comment".
:thumbsup:

Sorry for my post, which causes something confusing.
What I really meant is that in the C source code, they mentioned mobi*.py and I thought they made reference to your work.
I did not mean who you are.
I think they also use knowledge of MOBI structure from this forum.

I am Vietnamese and I have studied English on my own. And of course, my English has not been good enough to express what I think.

I join this forum in the hope of learning from gentle people here.
That's why I sometimes post my stupid questions.

Anyhow, thank you for your attention and your attitude.

pdurrant
09-21-2010, 09:09 AM
Anyhow, thank you for your attention and your attitude.

Do not worry! And please continue to ask questions!

I have done some work on the MobiDeDRM scripts, and I do know a fair amount about the Mobipocket file format through that.

But do check the wiki. And if you find anything there unclear or confusing, do ask. Then we can make it better.

quocsan
09-21-2010, 09:12 AM
Thank you pdurrant,
I will take your advice.

becky330
09-21-2010, 01:51 PM
I am new to this and am trying to understand how to use the mobiunpack python script. I used calibre to convert a book I created from .epub to .mobi. I have installed Python on my Mac and when I run the mobiunpack.py then add my .mobi file in the terminal window, it says "permission denied". Why is it doing this since it is my file? How do I actually use the script???

pdurrant
09-21-2010, 02:12 PM
I am new to this and am trying to understand how to use the mobiunpack python script. I used calibre to convert a book I created from .epub to .mobi. I have installed Python on my Mac and when I run the mobiunpack.py then add my .mobi file in the terminal window, it says "permission denied". Why is it doing this since it is my file? How do I actually use the script???

Exactly why you're getting the error message depends on the exact command, and what permissions you have for the locations you specify.

I'd suggest forgetting the terminal, and use the Applescript application I've just uploaded here: http://www.mobileread.com/forums/showthread.php?p=774836#post774836

Just drag&drop the mobipocket file onto the script, and it'll decode into a folder in the same location as the file. The first time you run it it will ask you to find the copy of the MobiUnpack.py script you have on your hard disk.

If you have any further problems, just ask.

GeoffC
09-22-2010, 10:39 AM
:hatsoff: Becky

Welcome to mobileread ....

st_albert
10-12-2010, 07:43 PM
I've finally gotten around to "discovering" mobiunpack, and now I have a few questions.

1) on both Linux and Windows, the output .html file seems to have Apple/Mac style end-of-line characters. Can this be fixed easily? I'm not a python programmer by any means, but i did try changing things like "f = open(outsrc, 'wb')" to "f = open(outsrc, 'w')" without effect.

2) I'm guessing the .html file produced is not supposed to be valid HTML. e.g. it lacks a <!DOCTYPE..> header, and the <guide> section in the <head> shouldn't be there. The presence of the <mbp: pagebreak /> tags are trivial.

Anyhow, it's a great tool for seeing what is going on inside the mobipocket file! Thanks for your efforts, all of you, whoever you are! :D

KevinH
10-12-2010, 08:08 PM
Hi,

I think the line endings depend on which type of machine was used to generate the original Mobi file. The ones you tested must have been made on a Mac platform. Luckily HTML itself is immune to line ending differences. But encodings (specificly utf-8) may need the high bit set so I would keep the 'wb'.

If you are on Linux or Mac OSX, simply use tr to remove or change them:

To replace carriage returns '\r' with new lines '\n':

cat FILE.html | tr '\r' '\n' > temp.html
mv temp.html FILE.html


To simply remove the carriage returns without replacing them

cat FILE.html | tr -d '\r' > temp.html
mv temp.html FILE.html


BTW: There is another tool: mobiml2html.py that will take the Mobi specific html file created by mobiunpack.py and make it xhtml if you want to archive things or convert them to epub.

It is available as python source code with a GUI front-end from the same site as a zip archive

http://code.google.com/p/ebook-conversion-tools/downloads/detail?name=ebook-conversion-tools.zip&can=2&q=

or you can checkout the source tree itself
http://code.google.com/p/ebook-conversion-tools/source/checkout

It is also available in the "tools" package mentioned on the ApprenticeAlf site.

Hope this helps,

KevinH

st_albert
10-13-2010, 11:22 AM
KevinH, Thanks for all the info. No, the files were not created on a mac. They were built on Linux and tested on Linux and Windows.

Actually it turns out that they seem to have no EOL characters at all. the "tr" command didn't change anything in the file. I had guessed Mac format because that's what notepad++ guessed.

In the end I used perl to add linebreaks between all tags (e.g. "s/></>\n</g"). That turns out to be overkill, but at least the file is readable and editable.

The clean-up tools you linked to work very well indeed.

adamselene
10-15-2010, 02:07 AM
A few things:

* MobiPocket is an old format, derived from HTML2 with some extensions. In HTML2 times, there was no !DOCTYPE, and in any case there is no need in MobiPocket to differentiate between document languages (because there is only one), so you shouldn't expect it to be there. In fact, quite a bit of what mobigen/kindlegen does is to convert HTML4 and XHTML to HTML2 by rewriting tags and flattening CSS into old-style tags.

* <guide> is one of the extensions. Basically they took an entire chunk of the .opf file and stuck it in the <head> tag so that devices could generate menus to navigate to parts of the document. There are historical reasons for doing it this way, originating with MobiPocket's predecessor formats, which were basically just one big HTML document wrapped in a Palm database file. There are many other ways this could have been done, but creating multiple files/streams within the Palm database would get awkward for several reasons, not least of all because links are all flattened to absolute file positions.

* mobigen/kindlegen specifically removes line breaks to make the file smaller, so you shouldn't expect to see any.

Honestly, MobiPocket is such a crappy format that I would strongly advise avoiding it at all costs, with the sole exception of using it as an output format to display on a Kindle. For all other purposes, you should use ePub. I only wrote the original mobiunpack.py because I tried to decompress the dictionary with other tools, it took more than 30 minutes, and I wanted to demonstrate that it could be done much better (even in Python).

st_albert
10-15-2010, 11:17 AM
A few things:

...

Honestly, MobiPocket is such a crappy format that I would strongly advise avoiding it at all costs, with the sole exception of using it as an output format to display on a Kindle. For all other purposes, you should use ePub. I only wrote the original mobiunpack.py because I tried to decompress the dictionary with other tools, it took more than 30 minutes, and I wanted to demonstrate that it could be done much better (even in Python).

Yes, I have to agree with you there, regarding mobi vs. epub format. Unfortunately, I'm pretty sure (don't have access to actual sales figures) that Kindle is our largest e-book sales outlet. So I'm always interested in learning how to deal with it better.

Thanks for the background information. I find it fascinating. Mobiunpack is a great tool for looking at what's inside the mobi package, and thanks to it I can actually SEE what you're talking about. I've been dabbling in ebook format conversions since Aportis Doc and Peanut Reader on Palm Pilots, but it has only been recently that I've taken a more "professional" interest. So much to learn!

sklamb
11-14-2010, 04:43 PM
Having finally bought my Kindle just as the price of modern digital books went up, I naturally turned to the wonderful world of out-of-copyright material for the bulk of my reading pleasure. Of course the quality of digitizing does vary a lot, and I'm just grateful for all the work that people have done already to make it possible to read books I'd otherwise not be able to get. However, I have a surprising number of (non-DRM) ebooks which need only a small number of errors corrected, and I'm OCD enough to want to do that if I can. I know calibre would solve some of these problems, but for editing an ebook originally generated in PRC this script seems much more suitable. Unfortunately I don't have Python installed on my Windows XP computer and I don't really want to get involved with all the complications that would involve just to do some PRC proofreading....

Is there any possibility that some kind person might convert this script into a Windows executable, as has been done for the mobiperl scripts?

I know it's an imposition and I feel guilty about not doing it for myself, but I'm getting older and doing something like installing Python doesn't seem as much fun as it used to.

sklamb
11-14-2010, 04:51 PM
Sorry...adding this post because I can't figure out how else to subscribe to this thread...had the wrong option set when I posted the first time... darn :newbie !

ATDrake
11-14-2010, 04:56 PM
1) Installing Python on Windows is as easy as double-clicking the installer from ActiveState Python Community Edition (http://www.activestate.com/activepython/downloads). Actually using it is admittedly a bit trickier, but perhaps someone will make a widgetized version.

2) You can subscribe to any thread without posting in it by clicking the Thread Tools button in the bar above the top post and choosing Subscribe.

3) Welcome to MobileRead!

sklamb
11-14-2010, 05:10 PM
Duh...thank you for that, ATDrake. (Especially as I apparently didn't succeed the other way....)

I may just have to grit my teeth and take on Python as well as the prc format (and XML and all the other things I only vaguely sorta know about). Somehow I hadn't expected getting a Kindle to turn me back into any sort of computer geek after decades of just being a user! :)

DiapDealer
11-14-2010, 07:13 PM
Is there any possibility that some kind person might convert this script into a Windows executable, as has been done for the mobiperl scripts?
The Windows program for mobiperl still requires Perl to be installed (it used to anyway). So even though someone might write a different front-end for MobiUnpack (there's already a Tk front-end)... chance are, it will still require Python.
(Even though a Python to C port of MobiUnpack probably wouldn't be that difficult... there'd then be two separate versions to maintain)

sklamb
11-14-2010, 08:21 PM
Very humbly...what's Tk? I thought what was available was the original script and an applet for the Mac....

DiapDealer
11-14-2010, 08:47 PM
Very humbly...what's Tk?
It's just some standard GUI type stuff that comes standard with almost all versions of Python. The Tools archive (from Alf's blog) has a GUI front-end for MobiUnpack that will work for pretty much any O.S. (that has python installed, of course). It allows you to choose the files and output directories with standard file dialogs and familiar buttons and such. You have to install python, but none of the scripts really require you to get down and dirty with command-line stuff if you don't want to... while still allowing those who actually prefer to get down and dirty, to do so. ;)

sklamb
11-14-2010, 09:28 PM
It's just some standard GUI type stuff that comes standard with almost all versions of Python. The Tools archive (from Alf's blog) has a GUI front-end for MobiUnpack that will work for pretty much any O.S. (that has python installed, of course).

Oh, dear...Alf's blog? I'm going to need an address, I'm afraid.... :o

...No, never mind, I worked that bit out for myself. Thank you so much for all your help!

discusaigon
12-04-2010, 02:45 PM
Hi,

Please some help for a newbe :smack:

I have installed python 2.7.1
put my PRC file (same as mobi file right ?)into the same folder as mobiunpack.py and run the mobiunpack.py.

but nothing happens

do I have to use the cmd.exe to run this script ?

how to ?

thanks.

KevinH
12-04-2010, 03:29 PM
Hi,

If you are on windows, I assume you have downloaded and installed ActiveState's ActivePython 2.X for 32 bit as was recommended in this very thread.

Next, you should know there is a gui version of the program (i.e no command line needed) available called MobiUnpack.pyw which actually runs the mobiunpack.py script for you (which is hidden away in the lib directory beside MobiUnpack.pyw).
To run the gui version, simply double-click on MobiUnpack.pyw


To run the command line version:
Yes, you need to run cmd.exe.

But first thing is to create a new folder to hold the contents of the unpack operation. So create a new folder right beside mobiunpack.py and your .mobi file.

Then run cmd.exe and cd to where mobiunpack.py exists

Then run the following command at the prompt:

python mobiunpack.py YOURMOBIFILEHERE.mobi YOURNEWOUTPUTFOLDER

That should unpack any non-drm Mobi file into its pieces. If you look in YOURNEWOUTPUTFOLDER afterwards you will see the .html images/ .op files.

If you run into trouble simply pm KevinH here and I can try to help.

discusaigon
12-04-2010, 05:34 PM
Hi I tried the .pyw but it did nothing

I tried the py in cmd and got an error, the file could not be found :
http://data.imagup.com/5/1106174364.JPG[/URL]

however, I'm sure, ther is no mistake in the name of the file.

KevinH
12-04-2010, 06:55 PM
Hi,

What happened exactly (what error message) when you double-clicked on MobiUnpack.pyw? If you have properly installed Python on your machine, double-clicking on the MobiUnpack.pyw file should start up a gui window. To test if that would be the case, right click on the MobiUnpack.pyw file and make sure it is properly associated with the pythonw.exe executable that is part of Python 2.X. You probably have not added the Python executable path to your System PATH environment variable or you are using Python 2.7 from python.org and not the ActiveState version that was recommended.

As for your command line errors, if your file exists, then your issue is one of paths. The mobiunpack.py program looked right beside itself to find the fra-eng.mobi book and could not find it (that is why there is an error message).

To make things a little easier triple check that you have copied mobiunpack.py and your fra-eng.mobi to the exact same folder and that inside that folder (right beside your fra-eng.mobi book and mobiunpack.py), you create a new empty folder called "book".

Then run cmd.exe and cd to the folder where mobiunpack.py, your "book" folder and fra-eng.mobi are.

The type the following command:

python.exe mobiunpack.py fra-eng.mobi book

Once complete, you will find the pieces inside the "book" folder.
If not, then there is probably something wrong with how you have installed Python.

discusaigon
12-06-2010, 04:22 AM
the probleme is I don't have any error message when I try to use the pyw

nothing just happens

I may have a problem with my python installation

discusaigon
12-06-2010, 04:53 AM
It is bizar ...

I tried an old script I had in pyw (ineptpdf.pyw) and it worked well so python is well installed on my computer.

I simplified the name of my file to avoid mistakes and still have the same errors with the script :

http://data.imagup.com/4/1106301467.JPG[/URL]

as you can see ther is no mistake in the folder or file name

quocsan
12-06-2010, 05:51 AM
Please change directory to I:\py then run the script or put "\py" before the ebook name (such as \py\dico.mobi.
It showed errors because there is no "dico.mobi" in the current directory (I:\)

discusaigon
12-06-2010, 10:52 AM
thanks !

changing direcotry was the trick !

I learned how to use the cd\

and it works well now.

I tried with the first script and it was quick.

Now with the second, it's slower but I understand it might be normal as it has to manage images.

pdurrant
05-18-2011, 10:30 AM
[Enhancement: Now works with some TEXtREAd early Mobipocket files]
[Enhancement: Added character set metadata to HTML file]

See http://www.mobileread.com/forums/showpost.php?p=774836&postcount=5

naisren
05-31-2011, 10:08 PM
[Enhancement: Now works with some TEXtREAd early Mobipocket files]
[Enhancement: Added character set metadata to HTML file]

See http://www.mobileread.com/forums/showpost.php?p=774836&postcount=5

Thanks for your improvement.:thumbsup:
Now the big challenge should be mobi dictionary.

pdurrant
06-01-2011, 02:11 AM
Thanks for your improvement.:thumbsup:
Now the big challenge should be mobi dictionary.

My next big challenge is to handle multiple authors correctly, which requires big internal changes.

But feel free to have a go at handling dictionaries yourself...

siebert
06-27-2011, 06:06 PM
Now the big challenge should be mobi dictionary.

It took me several weeks of reverse engineering and a few days of coding, but here is finally the first mobiunpack version supporting dictionaries!

I've made some shortcuts and omitted features not necessary for the dictionaries I'm interested in (e.g. unicode support, the old deprecated inflection format), so the script might not work for the dictionary you want to unpack, but feel free to improve the code :)

A couple of other fixes and enhancements are also included, most notable some speed optimizations. A huge dictionary is now unpacked within minutes instead of hours by using temporary files.

Have fun!

73437

naisren
06-28-2011, 01:58 AM
It took me several weeks of reverse engineering and a few days of coding, but here is finally the first mobiunpack version supporting dictionaries!

I've made some shortcuts and omitted features not necessary for the dictionaries I'm interested in (e.g. unicode support, the old deprecated inflection format), so the script might not work for the dictionary you want to unpack, but feel free to improve the code :)

A couple of other fixes and enhancements are also included, most notable some speed optimizations. A huge dictionary is now unpacked within minutes instead of hours by using temporary files.

Have fun!

73437

My god, I could not believe where I am now. You make my dream come true.
unpacking speed, inflection, unicode, all these were pain, now no more harassing!

Thanks a lot, :iloveyou:

my test:
mobiunpack.py PocketOxford.mobi
MobiUnpack 0.26
Copyright (c) 2009 Charles M. Hannum <root@ihack.net>
With Images Support and Other Additions by P. Durrant and K. Hendricks
With Dictionary Support and Other Additions by S. Siebert
Unpacking Book...
Mobipocket version 4
Huffdic compression
Unpack raw html
Document contains orthographic index, handle as dictionary
Info: Index doesn't contain entry length tags
Read dictionary index data
Warning: There are unprocessed index bytes left: 08
Warning: There are unprocessed index bytes left: af
Warning: There are unprocessed index bytes left: 01 a1
Warning: There are unprocessed index bytes left: 00 18 ff
Warning: There are unprocessed index bytes left: 75 70
Warning: There are unprocessed index bytes left: 76 04 8e
Warning: There are unprocessed index bytes left: aa 01 c0
Warning: There are unprocessed index bytes left: 28
Warning: There are unprocessed index bytes left: 67 02 77
Warning: There are unprocessed index bytes left: c4 00 d0
Warning: There are unprocessed index bytes left: 6f 75
Warning: There are unprocessed index bytes left: 0a
Decode images
Find link anchors
Insert data into html
Insert hrefs into html
Remove empty anchors from html
Insert image references into html
Write html
Write opf
Completed
The Mobi HTML Markup Language File can be found at: PocketOxford\PocketOxford.html

<mbp:pagebreak></mbp:pagebreak> <a></a><idx:entry>
<idx:orth value="ley [1]">
</idx:entry>
<div bgcolor="#FFFFDD" border="1" bordercolor="#000066"><span color="#000066"> <b>ley</b> [1] </span></div> <div align="left"> <span color="red"><i>noun</i></span> <br/><img src="images/00005.jpg" /> a piece of land temporarily put down to grass, clover, etc. </div> <br/><span color="#000066">ORIGIN</span>: Old English, «fallow»; related to <a href="" filepos="0011367107" ><b><small>LAY</small></b></a> and <a href="" filepos="0011568603" ><b><small>LIE</small></b></a>. <hr color="#000066" width="70%"/> <div align="center"><a onclick="history.back()"><img src="images/00006.jpg" border="0" align="middle"/> Back</a>***<a onclick="index_search()"><img src="images/00004.jpg" align="middle" border="0"/> New Search</a></div> <mbp:pagebreak></mbp:pagebreak> <a></a><idx:entry>
<idx:orth value="ley [2]">
</idx:entry>
<div bgcolor="#FFFFDD" border="1" bordercolor="#000066"><span color="#000066"> <b>ley</b> [2] </span> (also <b>ley line</b>) </div> <div align="left"> <span color="red"><i>noun</i></span> <br/><img src="images/00005.jpg" /> a supposed straight line connecting three or more ancient sites, associated by some with lines of energy and other paranormal phenomena. </div> <br/><span color="#000066">ORIGIN</span>: variant of <a href="" filepos="0011383653" ><b><small>LEA</small></b></a>. <hr color="#000066" width="70%"/> <div align="center"><a onclick="history.back()"><img src="images/00006.jpg" border="0" align="middle"/> Back</a>***<a onclick="index_search()"><img src="images/00004.jpg" align="middle" border="0"/> New Search</a></div>

:2thumbsup

naisren
06-28-2011, 01:59 AM
and its unpacking speed is far faster than ever.

siebert
07-18-2011, 05:12 PM
Hi,

I've created a new version 0.27 with the following changes:


NEW: Extract and save source zip files included by kindlegen as kindlegensrc.zip.
FIXED: idx:entry attribute "scriptable" must be present to create entry length tags.
FIXED: Don't save non-image sections as images.


74551

To ease development I've also pushed my git repository for the mobipocket script to github. Feel free to fork it if you want to improve the script:

https://github.com/siebert/mobiunpack

Ciao,
Steffen

KevinH
07-19-2011, 07:26 PM
Hi Steffen,

Okay, here is a slightly revised version of what you did. I must admit my image name replacement is slower than yours but still much faster than the old version. If need be we can condition this code on if " processing dictionary" or not and add back in your fixed image file extension version simply for pure speed.

I called it version v0.28 to differentiate it. If it works okay for you, we can then integrate it into your git repository

pdurrant
07-20-2011, 01:44 AM
If anyone has a good suggestion for how to fix the problem of loss of multiple metadata entries, I'd love to hear it. (i.e. if there's more than one author listed, we only save and write out one of them.)

siebert
07-20-2011, 05:19 AM
If anyone has a good suggestion for how to fix the problem of loss of multiple metadata entries, I'd love to hear it. (i.e. if there's more than one author listed, we only save and write out one of them.)

I'm not sure how multiple metadata entries are stored in the mobi file, but I would assume that it has just multiple entries with the same id?

Then it should be as easy as storing a list of strings instead of a single string in metadata[name] and for the output just iterate over the list.

Ciao,
Steffen

pdurrant
07-20-2011, 05:30 AM
I'm not sure how multiple metadata entries are stored in the mobi file, but I would assume that it has just multiple entries with the same id?

Than it should be as easy as storing a list of strings instead of a single string in metadata[name] and for the output just iterate over the list.

Ciao,
Steffen

Oh, of course! Perhaps we could just store all metadata as a list, rather than a single object in the map.

Anyone better than me at Python like to give it a go?

siebert
07-20-2011, 05:33 AM
Hi Steffen,
I called it version v0.28 to differentiate it. If it works okay for you, we can then integrate it into your git repository

I'm not so happy that you disabled skipping over sections which has been recognized as non-image sections.

As far as I know these non-image sections appear only at the end after all images, never between images, so it should not confuse the image name index if we just skip over them.

In that case we should also handle image sections where we can't determine the type as an error and print some message to stdout.

I think it might be even faster to just search all <img> tags in the source and then merge the source with the replaced <img> tags like we do the merge with in the "apply dictionary metadata and anchors" section.

Of course we have to skip over the original <img> tags in this merge, but as the match objects contain the position information where the found string was located, it should be very easy.

Since we have only the list of <img> tags in memory instead of all the splitted source file data, I expect that solution to be faster than your implementation.

Ciao,
Steffen

pdurrant
07-20-2011, 05:47 AM
As far as I know these non-image sections appear only at the end after all images, never between images, so it should not confuse the image name index if we just skip over them.


Some Mobipocket files that have been edited with the Perl tools may have images after the non-image bits at the end.

KevinH
07-20-2011, 10:11 AM
Hi Steffen,

> I'm not so happy that you disabled skipping over sections which has been recognized as non-image sections.

I am not sure what your concern here is. A file is only created if it is a known image type. The remaining code in the loop simply invokes imghdr which just looks at select bytes strings near the front of the data string (very much like what you are doing, so it should be very fast) and then appends an placeholder to a list. Nothing here will impact processing time much if at all versus your version.

> As far as I know these non-image sections appear only at the end after all images, never between images, so it should not confuse the image name index if we just skip over them.

As Paul indicated, this may not be the case so this version is safer.

> In that case we should also handle image sections where we can't determine the type as an error and print some message to stdout.

Feel free to add that if you like. My main concern was properly adding the image filename extensions so that later post processing to xhtml works properly (ie. for those not using kindlegen or mobipocket create)

> I think it might be even faster to just search all <img> tags in the source and then merge the source with the replaced <img> tags like we do the merge with in the "apply dictionary metadata and anchors" section.

That is similar to what is happening here. Regular expressions are used to split which breaks up the string into segments where all of the odd pieces 1,3,5,7 are the img tags and the even pieces are everything else before or after.

Then when we do replacements all we are doing is dropping an element from the list and replacing it and we only process the img tags themselves. So no need to create and delete 26mb-100mb copies all of the time. And then you simply put it back together using join.

> Of course we have to skip over the original <img> tags in this merge, but as the match objects contain the position information where the found string was located, it should be very easy.

> Since we have only the list of <img> tags in memory instead of all the splitted source file data, I expect that solution to be faster than your implementation.

Makes sense. Please feel free to make any changes you like. I only have one old dictionary to test with and so can't really fine tune it much. If your way is faster and keeps the proper image file name extensions, I am all for it.

Once we have that stable, I am going to test timewise comparing FastConcat with hugeFile set to FastConcat without to see how much of a penalty it is to do everything in memory but with lists of string segments and not one huge string constantly being added to.

Take care,

Kevin

KevinH
07-20-2011, 12:50 PM
Hi,

For fun ... I ran mobiunpack_v0.28.py on my one dictionary (file size is 27,585,020 bytes) and timed it (clock time from date in shell script both before and after mobiunpack) and then hard coded hugeFile to False and re-ran.

With hugeFile set as True: (uses file IO to temporary files)

Run Start Stop Elapsed Time
1 12:25:21 12:26:39 1 minute 18 seconds
2 12:26:45 12:28:02 1 minute 17 seconds

With hugeFile set as False (uses lists of strings and "".join(strlist)

Run Start Stop Elapsed Time
1 12:29:18 12:30:32 1 minute 14 seconds
2 12:30:38 12:31:53 1 minute 15 seconds

It was as I expected. There is no "memory issue" when using lists of strings.
In most OS's File IO has overhead and typically writes data to large memory buffers (buffered io) and does not actually flush them to disk unless pushed or until closed. So any slight savings in memory use is offset by the disk overhead.

So it appears there is no real advantage for using temporary file IO over using lists of strings and a final join.

Please try the same thing with your dictionaries and see if you get the same results. If so, we can probably remove the file io approach and remove FactConcat and just go with the string list approach.

Thanks,

Kevin

siebert
07-20-2011, 03:45 PM
Please try the same thing with your dictionaries and see if you get the same results. If so, we can probably remove the file io approach and remove FactConcat and just go with the string list approach.


I'll do. I've also implemented my proposed handling of the image tags, but I'm not any longer sure that it should be faster than your implementation, but I'll do some measurements on this variant, too.

Ciao,
Steffen

siebert
07-20-2011, 03:50 PM
I am not sure what your concern here is. A file is only created if it is a known image type. The remaining code in the loop simply invokes imghdr which just looks at select bytes strings near the front of the data string (very much like what you are doing, so it should be very fast) and then appends an placeholder to a list. Nothing here will impact processing time much if at all versus your version.


My concern wasn't speed but that it's not very elegant to search for image headers in sections already known to not contain images.


As Paul indicated, this may not be the case so this version is safer.


I would consider such files to be broken, especially if the additional image sections occur after the EOF-section. But if such file exists we should be able to decode them, of course.

So I've changed the code to skip non-image sections again but still work for such broken files.

Ciao,
Steffen

pdurrant
07-20-2011, 03:54 PM
So I've changed the code to skip non-image sections again but still work for such broken files.

That does seem more elegant. I'm really pleased to see work being done on this useful script again.

siebert
07-20-2011, 04:05 PM
Oh, of course! Perhaps we could just store all metadata as a list, rather than a single object in the map.

Anyone better than me at Python like to give it a go?

I've started to implement storing metadata as lists, and while it's not pretty it seems to work, though I haven't yet tested a file which actually contains duplicate metadata tags.

But I've noticed that several tags are currently not handled by mobiunpack (e.g. 202-209, 300).

I'm would like to get some input about how mobiunpack should handle them. I doubt that mobigen/kindlegen supports all these tags (if any), but there are already tags that will be exported to the opf file despite they are ignored by mobigen/kindlegen (e.g. the ASIN).

Are there other tools which actually support these tags and use the values or are they just for information?

In the latter case I would like to mark them as such (for example by putting them into a comment section) to make clear that their value won't affect the generated mobi.

Another solution would be to define a new list of ignored tags, so it's clear that we are aware of those tags but deliberately don't include them in the opf file.

Ciao,
Steffen

pdurrant
07-20-2011, 04:16 PM
But I've noticed that several tags are currently not handled by mobiunpack (e.g. 202-209, 300).

I'm would like to get some input about how mobiunpack should handle them. I doubt that mobigen/kindlegen supports all these tags (if any), but there are already tags that will be exported to the opf file despite they are ignored by mobigen/kindlegen (e.g. the ASIN).

Are there other tools which actually support these tags and use the values or are they just for information?

In the latter case I would like to mark them as such (for example by putting them into a comment section) to make clear that their value won't affect the generated mobi.

Another solution would be to define a new list of ignored tags, so it's clear that we are aware of those tags but deliberately don't include them in the opf file.


I think that idea of exporting all the information in the EXTH tags, even if only as comments, is a very good one.

We could have a list of tags for export as comments, where we have some idea of what the tags mean, and then also do a simple dump into comments of any completely unknown tags.

The plan (if it can be called that) behind the opf generation was to add as much info from the EXTH as possible that was valid in an OPF file, whether of not KindleGen would use it.

I'm looking forward to seeing what you come up with. I do have some test files with multiple authors.

KevinH
07-22-2011, 12:12 PM
Hi,

Instead of making all metadata elements lists which is a bit messy code wise (especially for something that is not a common event) it may be easier and cleaner to check if a value with that key already exists and if so appending a string delimiter (can be any unique identifier string we want - '"&#$%" or whatever) and then add the new data to the end. That if there is only 1 author or many authors, all data is stored in a simple string in the metadata dictionary.

Clean and easy to do using .get(key, '"") on the key to return either the current value for that key or the null string, if not null you append the string delimiter, then you just append the new value for the key. It also works with encoding to utf-8 quite easily.

When we go to write it out, simply split on the string delimiter and write out each one. If there is no delimiter present in the string , you will only write out 1.

As for keeping all values for metadata, I am for that but we need to be careful in that some mobs will have binary data in some metadata values (left over from keys previously used for DRM, etc) and we can run into byte values that do not exist in utf-8. So we may want to hex or base64 encode these values if you want to maintain them in some way.

My two cents,

Kevin

siebert
07-22-2011, 12:56 PM
Hi,

Instead of making all metadata elements lists which is a bit messy code wise (especially for something that is not a common event) it may be easier and cleaner to check if a value with that key already exists and if so appending a string delimiter (can be any unique identifier string we want - '"&#$%" or whatever) and then add the new data to the end.


Sorry, but using strings with delimiters would be a very unpythonic solution.

One might implement a solution to use strings for single values and a list of strings only if multiple values exist and use type() to distinguish both cases, but I've refactored my all-list solution already to be usable.

I'm almost done (the temporary file code was also removed), do you want me to just publish it when its finished, or do you want to take a look before (let me know your email address then)?


As for keeping all values for metadata, I am for that but we need to be careful in that some mobs will have binary data in some metadata values (left over from keys previously used for DRM, etc) and we can run into byte values that do not exist in utf-8. So we may want to hex or base64 encode these values if you want to maintain them in some way.


I decided to have a list of types to ignore (so far 209, 300 and 403), as the content is unprintable and of very little interest. The values of all other supported tags are supposed to be printable.

By having a list for them the code can now warn about any unknown tag it might occur.

Ciao,
Steffen

pdurrant
07-22-2011, 01:01 PM
Sorry, but using strings with delimiters would be a very unpythonic solution.


I agree with this. I hate the idea of concatenating arbitrary strings with some fixed delimiter.

I would have thought that making them all lists was actually quite a simple and clean solution, even if most of them end up as a list of just one item. Surely python handles single item lists fairly efficiently?

pdurrant
07-22-2011, 01:02 PM
I'm almost done (the temporary file code was also removed), do you want me to just publish it when its finished, or do you want to take a look before (let me know your email address then)?


I think you might as well just post it here once you're done.

siebert
07-22-2011, 01:12 PM
I'm thinking about redirecting the "Info:", "Warning:" and "Error:" output lines to stderr instead of stdout. Any thoughts on that?

Ciao,
Steffen

pdurrant
07-22-2011, 03:13 PM
I'm thinking about redirecting the "Info:", "Warning:" and "Error:" output lines to stderr instead of stdout. Any thoughts on that?

Ciao,
Steffen

I don't suppose it would do any harm, so if it would make your use of it easier, I don't see why not.

[EDIT: I see KevinH does have a good reason for not redirecting these.]

KevinH
07-22-2011, 03:21 PM
Hi,

I'm thinking about redirecting the "Info:", "Warning:" and "Error:" output lines to stderr instead of stdout. Any thoughts on that?

Ciao,
Steffen

Please leave these as stdout. If you look, stdout has been modified to run completely unbuffered (to flush after every write).

This is done to support a front-end gui program that invokes this python script via a python subprocess library so that any progress or errors/ warnings are shown in the gui log window in real time.

For people with Python properly installed, they can double-click on the MobiUnpack.pyw program and use simple gui elements to run the program. This is much easier for some people.

Please see MobiUnpack.pyw from the google-code site I pointed you at earlier or I can post it for you if you like.

Thanks,

Kevin

siebert
07-22-2011, 04:22 PM
Hi,
Please leave these as stdout. If you look, stdout has been modified to run completely unbuffered (to flush after every write).


Shouldn't it be possible to make stderr unbuffered, too?

By separating stdout and stderr the gui could ignore the stdout messages completely (or use them to show a progress bar instead) and only messages on stderr would be displayed in the log window, as they indicate something unusual the user should be aware of.

Ciao,
Steffen

KevinH
07-22-2011, 04:40 PM
Hi,

On many systems stderr is always automatically unbuffered io. This is true for Mac OS X and Linux but I am not sure about Windows.

But you typically read them from separate pipes and can get "out of order" conditions that can be confusing to the user (ie. should you read and print the stderrr or stdout first when both pipes have data waiting). There might be a way to pipe them both to the same output when invoking the script via subprocess but that is typically OS dependent..

It is simply easier to keep everything going to stdout so that progress and any error messages can be seen in the log window which makes it nice/easy for bug reports from users who use the gui program that exists for them.

Thanks,

Kevin

ps. Here is the full tool set with the gui in case you want to modify or change it. It is very simplistic and meant to be basically independent of the underlying python script it invokes to the extent possible.

If python is in your path, you should be able to unzip and then double-click on MobiUnpack.pyw or MobiM22HTML.pyw and run the scripts (stored in the lib)

siebert
07-23-2011, 10:25 AM
Hi,

here is mobiunpack.py version 0.29 with the following changes:


NEW: Handle multiple metadata entries of the same type.
NEW: Additional metadata tags added.
CHANGED: Special handling of huge files is no longer required.
CHANGED: Reenable skipping of known non-image sections.
FIXED: Convert non-ascii filenames to utf-8 in opf.
FIXED: StartOffset shouldn't create a visible index entry.


74742

Ciao,
Steffen

pdurrant
07-23-2011, 10:58 AM
Hi,

here is mobiunpack.py version 0.29 with the following changes:


NEW: Handle multiple metadata entries of the same type.
NEW: Additional metadata tags added.
CHANGED: Special handling of huge files is no longer required.
CHANGED: Reenable skipping of known non-image sections.
FIXED: Convert non-ascii filenames to utf-8 in opf.
FIXED: StartOffset shouldn't create a visible index entry.


74742

Ciao,
Steffen

Thanks. It looks good. I didn't notice any special handling of the author tag, but I'll do some tests with the latest KindleGen. Perhaps it works properly with multiple author tags now.

KevinH
07-23-2011, 11:04 AM
Hi Steffen,

I didn't notice metadata 404 anyplace in the code which I think is the Text To Speech (TTS) flag. We might want to add that so that people can see whether this is enabled or not.

Thanks for your great work on this.

Kevin

siebert
07-23-2011, 11:07 AM
Thanks. It looks good. I didn't notice any special handling of the author tag, but I'll do some tests with the latest KindleGen. Perhaps it works properly with multiple author tags now.

mobigen/kindlegen accept multiple entries for some tags and ignores all but the first entry for other tags. The output of mobigen/kindlegen shows what metadata will be used for the generated book.

And multiple authors worked for me.

Ciao,
Steffen

pdurrant
07-25-2011, 02:38 PM
here is mobiunpack.py version 0.29 with the following changes

Now copied back to my post on the first page of the thread: http://www.mobileread.com/forums/showpost.php?p=774836&postcount=5

naisren
07-26-2011, 04:19 AM
I tried every version one by one, and get happier then. 0.29 is better in many ways.
For the better, I report just bad result here.
words reflection works, but javascript function does not, pop error after mobi back using mobigen.
most images work, but some warnings about invalid images come out when using the 0.29 rendered and untouched opf by mobigen.

siebert
07-26-2011, 04:38 AM
words reflection works, but javascript function does not, pop error after mobi back using mobigen.


AFAIK only the Mobipocket Reader supports javascript, while the Kindle Devices and Apps don't, so I have no use for the javascript features in Mobipocket.

If someone finds out what needs to be changed I might fix it.


most images work, but some warnings about invalid images come out when using the 0.29 rendered and untouched opf by mobigen.

Warnings about invalid content or insufficient sizes? And is the exported image broken if you open it with an image viewer? Can you provide me the original mobi file for testing?

Ciao,
Steffen

siebert
07-26-2011, 04:39 AM
Now copied back to my post on the first page of the thread: http://www.mobileread.com/forums/showpost.php?p=774836&postcount=5

Great. Did you have success with the multiple authors test?

Ciao,
Steffen

naisren
07-26-2011, 11:06 AM
I sent you a message, and thanks for your attention.
my steps,
1, unpack the dic mobi.
2, Using the new opf to repack it without editing anything, just for proving everything is back in its own way. During it, warnings about invalid images came out, I checked the so called invalid images, and tried to open them with picture tool and failed. Many images are OK without warning.
3, After the new mobi is generated, I opened it using PC mobi reader, and many functions refused to work.

siebert
07-26-2011, 04:41 PM
1, unpack the dic mobi.
2, Using the new opf to repack it without editing anything, just for proving everything is back in its own way. During it, warnings about invalid images came out, I checked the so called invalid images, and tried to open them with picture tool and failed. Many images are OK without warning.
3, After the new mobi is generated, I opened it using PC mobi reader, and many functions refused to work.

I've tried the same with the file you've send, but my experience is quite different. The unpacking with mobiunpack and repacking with mobigen works fine, no complains about images from both programs besides the common warning from mobigen that the cover image is too small.

The repacked mobi works fine in the kindle app (including images), but without the javascript functions as javascript is unsupported by the kindle app.

The mobipocket reader crashes when I try to open the file, probably due to the broken javascript (the javascript tries to open an index via name, but index names aren't yet supported by mobiunpack).

I don't know what did went wrong for you, but at least the image handling works fine for the file.

Here is the log (some strings were repaced by XXX):


python mobiunpack.py XXX.mobi test
MobiUnpack 0.29
Copyright (c) 2009 Charles M. Hannum <root@ihack.net>
With Images Support and Other Additions by P. Durrant and K. Hendricks
With Dictionary Support and Other Additions by S. Siebert
Unpacking Book...
Mobipocket version 6
Warning: Unknown metadata with id 404 found
Huffdic compression
Unpack raw html
Info: Document contains orthographic index, handle as dictionary
Info: Index doesn't contain entry length tags
Read dictionary index data
Warning: There are unprocessed index bytes left: XX XX
[...]
Warning: There are unprocessed index bytes left: XX XX
Decode images
Find link anchors
Insert data into html
Insert hrefs into html
Remove empty anchors from html
Insert image references into html
Write html
Write opf
Completed
The Mobi HTML Markup Language File can be found at: test\XXX.html

C:\mobidict>cd test

C:\mobidict\test>mobigen -c2 XXX.opf

*****************************************
* Mobipocket mobigen.exe V6.2 build 43 *
* A command line e-book compiler *
* Copyright Mobipocket.com 2003-2008 *
*****************************************

opt compression: Mobipocket huffdic compression
opt version: try to minimize (default)
Info(prcgen): Added metadata dc:Title "XXX"
Info(prcgen): Added metadata dc:Date "XXX"
Info(prcgen): Added metadata dc:Creator "XXX"
Info(prcgen): Added metadata dc:Publisher "XXX"
Info(prcgen): Added metadata dc:Subject "Dictionary"
Info(prcgen): Added metadata Short dic label "XXX"
Warning(prcgen): Guide title is empty. Item is ignored
Info(prcgen): Parsing files 0000001
Info(prcgen): Resolving hyperlinks
Info(prcgen): Resolving start reading location
Warning(prcgen): The start reading location could not be resolved.
Warning(prcgen): Cover is too small : C:\mobidict\test\images\image00XXX.jpeg
Info(prcgen/inflections): Number of new <idx:infl> inflection rules: 0000XXX
Info(prcgen/inflections): Of which rules used only once or twice: 0000XXX
Info(prcgen/inflections): Number of inflection rule groups: 0000XXX
Info(prcgen): Computing UNICODE ranges used in the book
Info(prcgen): Found UNICODE range: Basic Latin [20..7E]
Info(prcgen): Found UNICODE range: Latin-1 Supplement [A0..FF]
Info(prcgen/compress): Compiling source text for compression (4096 passes max).
Pass 0000001
Info(prcgen/compress): Compiling source text for compression (4096 passes max).
Pass 0000002
Info(prcgen/compress): Compiling source text for compression (4096 passes max).
Pass 0000004
Info(prcgen/compress): Compiling source text for compression (4096 passes max).
Pass 0000008
Info(prcgen/compress): Compiling source text for compression (4096 passes max).
Pass 0000016
Info(prcgen/compress): Compiling source text for compression (4096 passes max).
Pass 0000032
Info(prcgen/compress): Compiling source text for compression (4096 passes max).
Pass 0000064
Info(prcgen/compress): Compiling source text for compression (4096 passes max).
Pass 0000128
Info(prcgen/compress): Compiling source text for compression (4096 passes max).
Pass 0000256
Info(prcgen/compress): Compiling source text for compression (4096 passes max).
Pass 0000512
Info(prcgen/compress): Compiling source text for compression (4096 passes max).
Pass 0001024
Info(prcgen/compress): Compiling source text for compression (4096 passes max).
Pass 0002048
Info(prcgen/compress): Compiling source text for compression (4096 passes max).
Pass 0002174
Info(prcgen/compress): Compression pass 0000001
Info(prcgen/compress): Compression pass 0000002
Info(prcgen/compress): Compression pass 0000003
Info(prcgen/compress): Text compressed to (in % of original size): 010.18%
Info(prcgen/compress): Compression dictionary statistics: 001974180 bytes 00018
3450 entries
Info(prcgen/compress): Compression pass 0000004
Info(prcgen/compress): Text compressed to (in % of original size): 009.54%
Info(prcgen/compress): Compression dictionary statistics: 001402428 bytes 00012
8447 entries
Info(prcgen/compress): Compression pass 0000005
Info(prcgen/compress): Text compressed to (in % of original size): 009.32%
Info(prcgen/compress): Compression dictionary statistics: 001118938 bytes 00010
1641 entries
Info(prcgen/compress): Compression pass 0000006
Info(prcgen/compress): Text compressed to (in % of original size): 009.12%
Info(prcgen/compress): Compression dictionary statistics: 000894372 bytes 00008
1123 entries
Info(prcgen/compress): Compression pass 0000007
Info(prcgen/compress): Text compressed to (in % of original size): 009.00%
Info(prcgen/compress): Compression dictionary statistics: 000721806 bytes 00006
4837 entries
Info(prcgen/compress): Compression pass 0000008
Info(prcgen/compress): Text compressed to (in % of original size): 008.93%
Info(prcgen/compress): Compression dictionary statistics: 000586074 bytes 00005
1866 entries
Info(prcgen/compress): Compression pass 0000009
Info(prcgen/compress): Text compressed to (in % of original size): 008.90%
Info(prcgen/compress): Compression dictionary statistics: 000475936 bytes 00004
1489 entries
Info(prcgen/compress): Compression pass 0000010
Info(prcgen/compress): Text compressed to (in % of original size): 008.90%
Info(prcgen/compress): Compression dictionary statistics: 000388794 bytes 00003
3203 entries
Info(prcgen/compress): Compression pass 0000011
Info(prcgen/compress): Text compressed to (in % of original size): 008.91%
Info(prcgen/compress): Compression dictionary statistics: 000314792 bytes 00002
6574 entries
Info(prcgen/compress): Advanced compression successful (decoded and verified).
Info(prcgen): Final stats - text compressed to (in % of original size): 008.91%

Info(prcgen): The document identifier is: "XXX"
Info(prcgen): The file format version is V6
Info(prcgen): Saving MOBI file
Info(prcgen): MOBI File generated with WARNINGS!


Ciao,
Steffen

naisren
07-26-2011, 11:39 PM
I guess I used the old version of mobigen, which is similar with current mobi publisher version.
The following version could not repeat the warnings about images.

***********************************************
* Amazon.com kindlegen(Windows) V1.0 build 85 *
* A command line e-book compiler *
* Copyright Amazon.com 2009 *
***********************************************

Thanks for your quick test and your understanding (XXX replacement).

With the javascript part, I always think it is excellent for this reader software, which make the mobi reader more attractive. I checked many official mobi dics, old or new, almost every dic uses the javascript.

The javascript in mobi is not perfect yet, which is a good point when a professional ebook, such as medical one, needs simple table handling inquiry. Sooner or later Kindle would fell obliged to apply the javasript into different platforms readers.

siebert
07-27-2011, 06:47 AM
With the javascript part, I always think it is excellent for this reader software, which make the mobi reader more attractive. I checked many official mobi dics, old or new, almost every dic uses the javascript.

AFAIK a dictionary has to use JavaScript to implement the search form to locate a specific entry.


The javascript in mobi is not perfect yet, which is a good point when a professional ebook, such as medical one, needs simple table handling inquiry. Sooner or later Kindle would fell obliged to apply the javasript into different platforms readers.

I doubt that Amazon will port the JavaScript support for the Kindle and the Kindle apps, as for dictionary lookup within a book it's unnecessary and apart from dictionaries there seem to be very few books using JavaScript.

I'm just wondering, if you want to use the dictionary with Mobipocket Reader, what is your reason to unpack and repack it?

I created the dictionary support for mobunpack just because the dictionary I want to use in the Kindle app has a very wasteful formatting which means in most cases the content of the popup window doesn't contain the relevant information and I have to switch to the full dictionary display, which is rather inconvenient.

So I unpacked the dictionary, removed the javascript stuff and reformatted it and now most of the time the popup window content is sufficient.

Ciao,
Steffen

naisren
07-27-2011, 10:06 AM
AFAIK a dictionary has to use JavaScript to implement the search form to locate a specific entry.



I doubt that Amazon will port the JavaScript support for the Kindle and the Kindle apps, as for dictionary lookup within a book it's unnecessary and apart from dictionaries there seem to be very few books using JavaScript.

I'm just wondering, if you want to use the dictionary with Mobipocket Reader, what is your reason to unpack and repack it?

I created the dictionary support for mobunpack just because the dictionary I want to use in the Kindle app has a very wasteful formatting which means in most cases the content of the popup window doesn't contain the relevant information and I have to switch to the full dictionary display, which is rather inconvenient.
Steffen

Thanks for your views, and I try to use a picture to cobble my thoughts together to answer your questions here. I use WM phone a lot, I tried Kindle applications on different platforms, such as Kindle3, Android, IPad, I think WM mobipocket is good for me. The shot pic is from my TP2.
Anyway, I am satisfied with the current decoder tool, thanks everyone for developing this beautiful tool.

karunaji
08-05-2011, 12:28 PM
Thank you for mobi unpacker. That is exactly what I need as I wanted to fix readability for my dictionary.

Unfortunately I got the message Error: Dictionary contains multiple inflection index sections, which is not yet supported so the recreated dictionary file do not contain inflection. Can it be implemented somehow?

siebert
08-05-2011, 01:49 PM
Unfortunately I got the message Error: Dictionary contains multiple inflection index sections, which is not yet supported so the recreated dictionary file do not contain inflection. Can it be implemented somehow?

Of course it can, but I was just too lazy, as the dictionary I was interested in didn't need it.

Ciao,
Steffen

pdurrant
08-05-2011, 01:52 PM
Thank you for mobi unpacker. That is exactly what I need as I wanted to fix readability for my dictionary.

Unfortunately I got the message Error: Dictionary contains multiple inflection index sections, which is not yet supported so the recreated dictionary file do not contain inflection. Can it be implemented somehow?

I have no interest in Mobipocket format dictionaries, so I won't be trying. Siebert is the one who reverse engineered the dictionary support. Although he claims to not be interested, you could try persuasion, or ask for help in implementing it yourself

KevinH
08-05-2011, 03:34 PM
Hi,

FYI, I have attached a mobiunpack_v0.30.py in case anyone wants it. It has only slight changes from the official 0.29 version, all related to metadata.

Changes include:

Add <meta /> tags for:

- Cover ThumbNail Image
- Text to Speech Disabled Flag
- Font Signature encoded as hex
- Tamper Proof Keys encoded as hex
- All unknown metadata keys as hex strings

This allows someone to completely recreate any EXTH header region if they so needed and can help with further debugging/identification of unknown metadata keys.

karunaji
08-06-2011, 02:18 AM
Of course it can, but I was just too lazy, as the dictionary I was interested in didn't need it.

Ciao,
Steffen

Hi Steffen,

How hard it would be to figure this out? I tried to look at the format description but I probably won't have that much time to figure it out. I had some impression that it is not fully described.

I want it because I have bought PONS DE>EN dictionary and I use it to practice reading German texts. This dictionary provides word pronunciation in square brackets [] but in Kindle pop-up window it appears empty. I have to press ENTER and it's rather inconvenient. I decoded the dictionary and I can see that the pronunciation is composed of small images. But Kindle font actually contains extended IPA characters so it should be trivial to replace the images to chars and repack the dictionary.

I just tried to decode another dictionary DE>RU and it also contains multiple inflection index sections.

siebert
08-06-2011, 04:51 AM
How hard it would be to figure this out? I tried to look at the format description but I probably won't have that much time to figure it out.


I think you'll learn the needed things better from the code than from the format description.

I don't know how fluent your Python is, but the multiple index section support shouldn't be very hard. The most important information you have to figure out is whether the additional sections use their own different tag tables or not. If they do, you have to hold multiple tag tables in addition to multiple sections, but that's also doable, just a bit more code.

You can run mobiunpack with WRITE_RAW_DATA set to True to dump the index sections as individual files and use a hex editor to analyse the data.

But I'm afraid that in your case (german-english dictionary) you will also face an issue with the inflection rules, as the current implementation doesn't handle words with special characters like german umlauts properly. That might be just a simple text encoding issue, but it could also be something which needs additional reverse engineering effort.


I want it because I have bought PONS DE>EN dictionary and I use it to practice reading German texts. This dictionary provides word pronunciation in square brackets [] but in Kindle pop-up window it appears empty. I have to press ENTER and it's rather inconvenient. I decoded the dictionary and I can see that the pronunciation is composed of small images. But Kindle font actually contains extended IPA characters so it should be trivial to replace the images to chars and repack the dictionary.


Do I understand that correct that on the Kindle device the popup window doesn't support images so you can't read the pronuncation?

In the Kindle app the pronuncation is displayed fine, but I had to reformat my dictionary because the pronuncation and other information takes so much space that the popup window doesn't contain the actual translation for most words. So I removed pronuncation and unnecessary whitespace from the formatting to get a usable dictionary for the Kindle app.


I just tried to decode another dictionary DE>RU and it also contains multiple inflection index sections.

The german language seem to require much more inflection rules than for example english. So I would assume that most if not all german to whatever dictionaries will contain multiple inflection index sections.

If you want to test more dictionaries, mobipocket.com provides free sample downloads for (all?) dictionaries they sell, the samples are also without DRM, so you can just run mobiunpack on them.

Ciao,
Steffen

karunaji
08-07-2011, 01:39 AM
Thanks Steffen, for your detailed explanations.

I am not sure if I have enough patience to do this but I can always hope.

But I'm afraid that in your case (german-english dictionary) you will also face an issue with the inflection rules, as the current implementation doesn't handle words with special characters like german umlauts properly. That might be just a simple text encoding issue, but it could also be something which needs additional reverse engineering effort.

I noticed that with another dictionary. I suspect that it is a coding issue. That dictionary was in UTF-8 but it was expanded like Windows-1252.

Do I understand that correct that on the Kindle device the popup window doesn't support images so you can't read the pronuncation?

It is exactly how it is.

avid-e-reader
08-20-2011, 06:22 AM
Just found this thread and tool. Great stuff! Nice to see the HTML that kindlegen converts to! Helps to figure out what works and what doesn't.

However, I ran a little test book through kindlegen, then mobiunpack, and then kindlegen again, and somehow, the Kindle for PC won't display the cover page anymore.

I'll attach the .mobi that I unpacked, which should suffice for testing.

Also, a curiosity question: the Kindle claims to be able to play .mp3 files. Has anyone verified that, and does mobiunpack deal with such files? I have a background idea of adding some .mp3 to one of my ebooks, if it works. Kindle claims some video support too, but I'm not interested in that.

pdurrant
08-20-2011, 06:38 AM
Just found this thread and tool. Great stuff! Nice to see the HTML that kindlegen converts to! Helps to figure out what works and what doesn't.

However, I ran a little test book through kindlegen, then mobiunpack, and then kindlegen again, and somehow, the Kindle for PC won't display the cover page anymore.

I'll attach the .mobi that I unpacked, which should suffice for testing.


Testing on Mac with KindleGen 1.2 and Mobiunpack 0.29, it works fine for me. Copying the books to Windows XP, your sample book crashes my copy of Kindle for PC 1.5, but the recompiled version loads fine, and Go/Cover works too.

I think you need to update your copy of KindleGen.

avid-e-reader
08-20-2011, 06:49 AM
I tried mentally comparing my original .opf with the generated one, and there was no entry in the manifest for the cover image file. After adding one, then it regenerates fine. Would that indicate a problem in the kindlegen? I guess there is a newer one... but seems to be an omission from the .opf -- unless the newer one doesn't need the manifest entry.

avid-e-reader
08-20-2011, 06:59 AM
OK, upgrading from Kindlegen 1.1.99 to 1.2 means the manifest entry is no longer needed, and the syntax that I didn't recognize in the <x-metadata> tag covers the need. Thanks for the help.

pdurrant
08-20-2011, 07:04 AM
I tried mentally comparing my original .opf with the generated one, and there was no entry in the manifest for the cover image file. After adding one, then it regenerates fine. Would that indicate a problem in the kindlegen? I guess there is a newer one... but seems to be an omission from the .opf -- unless the newer one doesn't need the manifest entry.

As I said, you seem to be using an old version of Kindlegen. I'd suggest using the latest version, which I think is 1.2.

The opf from mobiunpack does not include the cover image in the manifest, but it does include it in the EmbeddedCover tag.

As far as I know, the cover image shouldn't be necessary in the manifest. In opf files for Mobpocket Creator it wasn't needed, and I think caused problems if it was there. Similarly images linked from the html weren't included in the manifest (& aren't included in the manifest by mobiunpack).

Since it works fine with Kindlegen 1.2, and we're trying to create input for Kindlegen, not valid ePub source, I don't intend changing this behaviour. You are free to do so yourself, of course.

[I see you're sorted. Message Lag.]

avid-e-reader
08-26-2011, 06:56 PM
Also, a curiosity question: the Kindle claims to be able to play .mp3 files. Has anyone verified that, and does mobiunpack deal with such files? I have a background idea of adding some .mp3 to one of my ebooks, if it works. Kindle claims some video support too, but I'm not interested in that.

Still a question if the mobi decoder works with .mp3 files embedded in the .mobi files? I'm about ready to start playing with adding such to one of my ebooks.

pdurrant
08-27-2011, 05:45 AM
Still a question if the mobi decoder works with .mp3 files embedded in the .mobi files? I'm about ready to start playing with adding such to one of my ebooks.

I've no idea how kindlegen does mp3 files. I rather doubt that the decoder will output them sensibly. If you have a sample Kindle file with an MP3 in it, it might be interesting to take a look at it.

tomsem
08-28-2011, 07:58 PM
Still a question if the mobi decoder works with .mp3 files embedded in the .mobi files? I'm about ready to start playing with adding such to one of my ebooks.

At present, I think only iOS Kindle app can play embedded audio/video. On Kindle, you just see 'There is video content at this place that can be played only on supported Kindle devices and reader applications.'

Kindle's mp3 support is limited to background playing, or (if in audible folder) using the audiobook player.

My guess is that decoder will not extract audio/video files, but I haven't played with that. Let us know what you discover.

pdurrant
09-01-2011, 05:22 AM
I've just uploaded version 31 to my post near the start of this thread (http://www.mobileread.com/forums/showpost.php?p=774836&postcount=5).

This has no changes for Kindle/Mobipocket ebooks, but adds initial support for Kindle/Print Replica ebooks, outputting a PDF file, and also several mysterious data sections.

The sample Print Replica ebook I have contains only one PDF. It seems from the data stored that it's possible to have more than one PDF stored in a Print Replica ebook. I have tried to guess how this will work, but it's entirely possible that I've guessed wrong, and that the code will need tweaking once someone comes across such an ebook.

As with Kindle/Mobipocket books, this script only works on non-DRM Print Replica ebooks, which at the moment seems to be only the samples that can be downloaded from Amazon.

Feedback very welcome indeed.

pdurrant
09-01-2011, 06:27 PM
Feedback very welcome indeed.

There was a typo that caused it to reject .azw4 file extensions. Now fixed, but version number not incremented.

DaleDe
09-01-2011, 07:18 PM
Great job, there is an AZW4 wiki page now if you want to document anything.

Dale

pdurrant
09-02-2011, 05:11 AM
Great job, there is an AZW4 wiki page now if you want to document anything.

Thanks. I've added in what I know, which isn't much. Just enough to know how much I don't know!

pdurrant
09-02-2011, 02:51 PM
I've now updated the Mobipocket Updater AppleScript to handle .azw4 (Print Replica) ebooks as well.

avid-e-reader
09-05-2011, 02:27 AM
What about .ncx files? Seems mobiunpack doesn't generate one, even if one was using during creation of the book?

pdurrant
09-05-2011, 03:44 AM
What about .ncx files? Seems mobiunpack doesn't generate one, even if one was using during creation of the book?

A good point. And that data must be in there somewhere, as it can be used to navigate in a Kindle file.

All that's necessary is for someone to spend the time to work out where and in what format it's stored, document that, and then for someone to take that info and write code to decode the info and create an ncx file from it.

Good luck!

avid-e-reader
09-05-2011, 03:55 AM
So mobiunpack actually creates this kindlegensrc.zip file somewhere along the line. I just found a (slightly tweaked) version of my .ncx file in there, in the "misc" directory. The tweaks are a reflection of the directory structure.

So it looks like the changes necessary would be threefold:
1) extract the .ncx file out of the kindlegensrc.zip and place it is a (newly created) misc directory, alongside the Images directory
2) reference the .ncx file in the <manifest> tag of the .opf file.
3) add a toc="toc" attribute to the <spine> tag of the .opf file.

I haven't looked at the source for mobiunpack, to know how hard or easy this would be, nor do I know what sort of source control is being used, or where it is located.

pdurrant
09-05-2011, 04:01 AM
So mobiunpack actually creates this kindlegensrc.zip file somewhere along the line. I just found a (slightly tweaked) version of my .ncx file in there, in the "misc" directory. The tweaks are a reflection of the directory structure.

So it looks like the changes necessary would be threefold:
1) extract the .ncx file out of the kindlegensrc.zip and place it is a (newly created) misc directory, alongside the Images directory
2) reference the .ncx file in the <manifest> tag of the .opf file.
3) add a toc="toc" attribute to the <spine> tag of the .opf file.

I haven't looked at the source for mobiunpack, to know how hard or easy this would be, nor do I know what sort of source control is being used, or where it is located.

The ncx in the kindlegensrc isn't the information in the Mobipocket file. The source files are just stored at the end of the file by kindlegen for mysterious purposes. They can be stripped from the Kindle ebook without losing the ncx navigation info in the file.

Perhaps MobiUnpack should be tweaked to also export any unknown binary data in the original file. Hmm...

avid-e-reader
09-05-2011, 04:03 AM
A little more info: the <manifest> tag should probably look like:

<item href="misc/toc.ncx" id="toc" media-type="application/x-dtbncx+xml" />

but with the file name possibly different in different cases, based on the actual .ncx file name found? But here's something I don't understand: there is also a content.opf in the kindlegensrc.zip file, but it doesn't seem to match the one generated by mobiunpack.

avid-e-reader
09-05-2011, 04:11 AM
(we crossed paths in the bitstream)
So if the kindlegensrc.zip is supposedly the source files, it is not the exact source files: the directory structure is modified, and the files are tweaked to reflect the changed directory structure.

Maybe exporting any unknown binary data into files would make disassembling it a bit easier, at least. I have no clue where the .ncx file goes, or what format it gets placed in (maybe Kovid does), but without any obvious way of looking at it, it is pretty hard to figure that out. Seems reasonably likely, if there are more binary pieces than mobiunpack presently ignores, that one (or more) of them probably is the .ncx data.

And maybe others would be the .mp3 files I was asking about earlier, although sadly adding .mp3 files is something that keeps slipping further out in my project list.

siebert
09-05-2011, 04:12 AM
But here's something I don't understand: there is also a content.opf in the kindlegensrc.zip file, but it doesn't seem to match the one generated by mobiunpack.

The kindlegensrc.zip contains the (slightly modified) sources used by kindlegen to create the mobi file. The content of kindlegensrc.zip should be sufficient to recreate the mobi file (with the exception of some fields which are not created by kindlegen based on the source).

If you have a kindlegensrc.zip, you can just ignore the remaining output of mobiunpack.

Unfortunatly most mobi files don't contain the record which contains the kindlegensrc.zip, so using the content to improve the mobiunpack output won't help in most cases.

But at least the new mobiwriter in calibre should handle ncx files, so the calibre source should give the information how the ncx content is encoded in the mobi file.

Ciao,
Steffen

pdurrant
09-05-2011, 04:17 AM
A little more info: the <manifest> tag should probably look like:

<item href="misc/toc.ncx" id="toc" media-type="application/x-dtbncx+xml" />

but with the file name possibly different in different cases, based on the actual .ncx file name found? But here's something I don't understand: there is also a content.opf in the kindlegensrc.zip file, but it doesn't seem to match the one generated by mobiunpack.

The kindlegensrc.zip file is just extracted from the penultimate record in the Kindle ebook. It's put in there by Kindlegen, but is not actually used by any rendering software.

All the other files generated by MobiUnpack are generated by decoding the info in the Kindle ebook. In particular, the opf file is put together from bit of info in the header, EXTH records, and even from the HTML. It should have most of the info in the original opf file, but not all that info will actually be contained in the Kindle ebook.

avid-e-reader
09-05-2011, 04:21 AM
And regarding .mp3 files, here's a sample .mobi with .mp3 that I got from somewhere, maybe with the Kindlegen documentation?

pdurrant
09-05-2011, 04:22 AM
But at least the new mobiwriter in calibre should handle ncx files, so the calibre source should give the information how the ncx content is encoded in the mobi file.


Oooo... I wonder if the developer of that has documented it in the wiki? That would make life easier. Hmm.. apparently not. When I have some spare time I'll check the calibre sources.

siebert
09-05-2011, 04:27 AM
Oooo... I wonder if the developer of that has documented it in the wiki? That would make life easier. Hmm.. apparently not. When I have some spare time I'll check the calibre sources.

As calibre can also decode a mobi, there might even exist some python code in calibre which creates the ncx file from an existing mobi.

Ciao,
Steffen

siebert
09-05-2011, 04:30 AM
And regarding .mp3 files, here's a sample .mobi with .mp3 that I got from somewhere, maybe with the Kindlegen documentation?

While reverse-engineering should be possible having the mobi only, it would be much easier if you could provide the sources for an example book which contains a mp3 file, as someone could build two mobi files from the source (one with mp3 and one without) and analyse the differences to learn how mp3 support is encoded.

Ciao,
Steffen

avid-e-reader
09-05-2011, 04:44 AM
The source was just an .html file and a .mp3 file (in a subdirectory named multimedia). Attached as a .zip.

DaleDe
09-05-2011, 10:57 AM
A little more info: the <manifest> tag should probably look like:

<item href="misc/toc.ncx" id="toc" media-type="application/x-dtbncx+xml" />

but with the file name possibly different in different cases, based on the actual .ncx file name found? But here's something I don't understand: there is also a content.opf in the kindlegensrc.zip file, but it doesn't seem to match the one generated by mobiunpack.

Of course it does not match. The kindlegensrc is likely an epub source file while mobiunpack generates a mobi source file. These are not the same thing and are not even the created with the same version of the idpf. Perhaps you do not realize that there was an earlier version of eBook standards that was originally used by eBook readers as a source file. Mobi, Lit, eBookwise IMP formats were all derive from that earlier standard. See our wiki under Open eBook for more details.

Hitch
09-05-2011, 04:53 PM
And regarding .mp3 files, here's a sample .mobi with .mp3 that I got from somewhere, maybe with the Kindlegen documentation?

The Jabberwocky mobi rather notoriously does not work. I'd say, therefore, that it's a skosh useless as an exemplar.

Hitch

siebert
09-06-2011, 06:28 AM
The source was just an .html file and a .mp3 file (in a subdirectory named multimedia). Attached as a .zip.

I've modified this sample to add also a video file and ran mobiunpack on it.

The handling of audio and video files is almost identical to image files (surprise :)

The only difference is that there is a 12 byte header prepended to the original audio/video file which starts with "AUDI" or "VIDE" followed by 2 integers of unknown value.

Also quite similar to the image handling the source attributes of the html tags are replaced with the record numbers:

src="file.mp3" -> mediarecindex="00002"
poster="file.jpg" -> recindex="00003"

So it should be easy to add support for audio/video to mobiunpack.

But is audio/video support really used in the wild?

My understanding is that only very few Kindle platforms are supporting them (is there a list which shows the supported platforms?)

Ciao,
Steffen

pdurrant
09-06-2011, 09:46 AM
Well, I took a quick look at where the ncx file might be being stored, and it turns out that when an ncx file is added to the sources, you get three extra records added to the Mobipocket file.

Here's the source NCX file, along with the three added sections of the Mobipocket file (separated out into individual files).

I don't have time to properly decode the binary formats, but if anyone fancies a puzzle, here they are. The task is to work out how to reconstruct (as best as possible) the source ncx file from the compiled binary files.

DiapDealer
09-06-2011, 10:22 AM
I don't have time to properly decode the binary formats, but if anyone fancies a puzzle, here they are. The task is to work out how to reconstruct (as best as possible) the source ncx file from the compiled binary files.
For those who may be looking for insight into the ncx reconstruction from calibre source-code, I'd start with calibre/ebooks/mobi/input.py. Which will lead you to calibre/ebooks/mobi/reader.py... specifically the MobiReader class and its extract_contents function.

I can't get my head around it all quite yet, but maybe someday! ;)

kaizoku
09-10-2011, 11:07 AM
Getting some unknown Metadata error with this sample file. Rename the file to .azw4.

pdurrant
09-10-2011, 01:01 PM
Getting some unknown Metadata error with this sample file. Rename the file to .azw4.

I think that unknown Metadata should only be showing as a warning. There is almost always some unknown metadata, as the Mobipocket/Print Replica file format is undocumented.

MobiUnpack used to ignore it, now it mentions it. You can ignore it.

siebert
09-10-2011, 04:04 PM
I think that unknown Metadata should only be showing as a warning. There is almost always some unknown metadata, as the Mobipocket/Print Replica file format is undocumented.


Hi,

in my version I've already introduced a list of "known unknown" metadata (means that we know that these values exist, but we don't know the meaning) and mobiunpack complains only if an unknown value isn't in this list.

I hope I'll find time to release my version soon :)

Ciao,
Steffen

Anjelous
09-10-2011, 04:11 PM
Edited! Ok to delete this post as I found another thread that better answers my question :)

fandrieu
09-12-2011, 09:05 AM
Hello everybody.

First I'd like to thank the community for all the good work, without the homebrew tools my experience with the mobi file format and the kindle as a whole would'nt have been nearly as nice !

Back on topic, I first came here to ask if somebody's maintaining mobiunpack.py / accepting patches, but reading the last few post it would seem that both pdurrant and siebert are working on a branch, am I right ?
If so could I contribute ?

...

Also in the last few posts there were talks about extracting the NCX from mobi files, it just so happens that's the very feature I've been toying with this weekend and made me come here today :)
At this time I got a (pretty awful) proof of concept code that can extract flat "chapter only" NCX, I got the necessay clues from the "writer" part of the calibre mobi module, I could elaborate on that if somebody's interested...

Apart from that I made some corrections (like the encoding header in the html, which appears to be in siebert's branch) and also have an alternate "Adding anchors..." code that reconstructs all anchors, even when they're not referenced, and should avoid adding anchors in the <head> (a bug i encountered with some files).

I was also interested in re-factoring the code to be more readable / workable (this also appears to be in siebert's plans :)).

I started with the (pdurrant's ?) version @ http://code.google.com/, but wouldn't mind switching...

KevinH
09-12-2011, 09:16 AM
Hi,

Great! The more the merrier. If you look through this topic you will find links to later versions than what we (pdurrant and I) hosted on code.google.com - we have not bothered to update that site lately. Yes, you are right siebert has added support for Dictionaries and made some major speed improvements. I have added code to spit out more of the metadata so that the tool can be used to investigate more about what each metadata means (for example we recently found what we think is the expiration date), and pdurrant has added support for non-drm versions of the .azw4 format.

Simply walk through this thread and grab the very latest version of mobiunpack.py that you see and use that as your starting point. I believe you want mobiunpack.py version 0.31 posted by pdurrant a few days ago to this thread. siebert may have an even newer version but I don't think he has posted it yet. Let me know if you can't find it and I will post it again for you.

KevinH

pdurrant
09-12-2011, 09:16 AM
I'm afraid the google code versions is very much out of date. The current version is 0.31, which can be found in this thread here (http://www.mobileread.com/forums/showpost.php?p=774836&postcount=5).

While we really ought to be using version control software, in a clever shared manner, at present we seem to be posting updates here, which I copy back to the fifth post in this thread.

Some ncx generation code would be welcome. I posted a sample of the binary data representing an ncx, along with the source ncx file, here (http://www.mobileread.com/forums/showpost.php?p=1732188&postcount=147).

Any other changes would be interesting to see too.

DiapDealer
09-12-2011, 09:17 AM
I'd be very interested in the seeing the NCX extraction code you've come up with. I use calibre to convert epubs to mobi, and then feed the output of mobiunpack to kindlegen... so not having to rebuild the NCX by hand each time would be very welcome indeed.

siebert
09-12-2011, 11:54 AM
I use calibre to convert epubs to mobi, and then feed the output of mobiunpack to kindlegen...

Calibre had a debug option --kindlegen which used the kindlegen binary to build the mobi file.

Kovid removed that option a few days ago because he doesn't like me and my request to make that option selectable via the gui (though I'm obviously not the only person liking that feature), but if you are willing to use either an older or a modified version of calibre you don't need the mobiunpack step.

Ciao,
Steffen

fandrieu
09-12-2011, 12:14 PM
Thank you very much for all your quick replies !

I just downloaded v31 from the link you provided and finished retro-fitting the modifications I had made to v23.

I'll try to explain why / how I did those changes later but first, as code speaks louder than words, here's the file.

....

Just a few words:
* I just finished merging, it's not tested :(

* The NCX part is really a proof of concept, it does however produce an acceptable output on my test files with flat NCX.
It consists of:
- a code block with 3 methods just before unpackBook
- a main code block in unpackBook, enclosed by "#TEST NCX"
- a small mod to the OPF code, to add a ref to the NCX

* Other than there's some "empirical" changes I made while testing some files:
- FILEPOS_ON_ALL_ANCHORS: an option to use an alternate code that processes all empty anchors instead of focusing on existing links...
- replaced a " " by "\s+" in the "Insert hrefs into html" rx...
- alternate way to set the html file encoding

EDIT: sorry, the file i uploaded contained several fatal errors i failed to spot.
EDIT: the new file should work at least with calibre-generated mobis...

EDIT2: added a text file describing what I gathered of the NCX equivalent in MOBI

EDIT3: basic fixes to the code...

DiapDealer
09-12-2011, 12:48 PM
Calibre had a debug option --kindlegen which used the kindlegen binary to build the mobi file.

Kovid removed that option a few days ago because he doesn't like me and my request to make that option selectable via the gui (though I'm obviously not the only person liking that feature), but if you are willing to use either an older or a modified version of calibre you don't need the mobiunpack step.
How would that be any different than feeding the epub directly to kindlegen? My reason for using calibre as an intermediate step is because calibre does a much better job of translating/flattening an ePub's CSS into a mobi that more accurately reflects the original (visibly) than kindlegen currently does. Kindlegen can then take the mobiunpack output and create the final mobi (with the approved tools). Or am I missing something?

KevinH
09-12-2011, 01:58 PM
Hi,

Thanks for posting! I grabbed it and tried it on a bunch of mobis I had and unfortunately, the internal links anchors from many of the internal links in the document no longer work. I tested it with mobiunpack version 31 without your changes and all internal links worked.

So somehow your changes have broken some of the internal links.
I will try to track this down.

I did get some form of NCX file but it was incomplete and there were error messages:

Write html
ERROR: last byte not 0x80
ERROR: text not found 1354424
Wite ncx
Write opf

I will keep playing with it to see if I get get the internal links working again.

Thanks for getting this ncx stuff going!

KevinH



Thank you very much for all your quick replies !

I just downloaded v31 from the link you provided and finished retro-fitting the modifications I had made to v23.

I'll try to explain why / how I did those changes later but first, as code speaks louder than words, here's the file.

....

Just a few words:
* I just finished merging, it's not tested :(

* The NCX part is really a proof of concept, it does however produce an acceptable output on my test files with flat NCX.
It consists of:
- a code block with 3 methods just before unpackBook
- a main code block in unpackBook, enclosed by "#TEST NCX"
- a small mod to the OPF code, to add a ref to the NCX

* Other than there's some "empirical" changes I made while testing some files:
- FILEPOS_ON_ALL_ANCHORS: an option to use an alternate code that processes all empty anchors instead of focusing on existing links...
- replaced a " " by "\s+" in the "Insert hrefs into html" rx...
- alternate way to set the html file encoding

EDIT: sorry, the file i uploaded contained several fatal errors i failed to spot.
EDIT: the new file should work at least with calibre-generated mobis...

pdurrant
09-12-2011, 02:03 PM
How would that be any different than feeding the epub directly to kindlegen? My reason for using calibre as an intermediate step is because calibre does a much better job of translating/flattening an ePub's CSS into a mobi that more accurately reflects the original (visibly) than kindlegen currently does. Kindlegen can then take the mobiunpack output and create the final mobi (with the approved tools). Or am I missing something?

I believe the --kindlegen option did the usual conversion to mobipocket-specific HTML, but them used kidnlegen to compile it into an actual mobipocket file rather than calibre's own mobipocket file generation code.

DiapDealer
09-12-2011, 02:29 PM
I believe the --kindlegen option did the usual conversion to mobipocket-specific HTML, but them used kidnlegen to compile it into an actual mobipocket file rather than calibre's own mobipocket file generation code.
Ahhh, ok, that makes sense. Thanks.

KevinH
09-12-2011, 02:32 PM
Hi,

Your link_pattern used if FILEPOS_ON_ALL_ANCHORS is True seems to be a bit broken:

For example: here is what the rawml says for one link:

<a filepos=0000006414 >M<span><font size="2">APS</font></span></a>

but this link is never properly detected or processed by your link pattern:

link_pattern = re.compile(r'''<a\s*(></a>|/>)''', re.IGNORECASE)

So you might want to take another look at your link patterns to make sure rawml of this type gets processed properly.

Hope this helps,

KevinH




Hi,

Thanks for posting! I grabbed it and tried it on a bunch of mobis I had and unfortunately, the internal links anchors from many of the internal links in the document no longer work. I tested it with mobiunpack version 31 without your changes and all internal links worked.

So somehow your changes have broken some of the internal links.
I will try to track this down.

I did get some form of NCX file but it was incomplete and there were error messages:

Write html
ERROR: last byte not 0x80
ERROR: text not found 1354424
Wite ncx
Write opf

I will keep playing with it to see if I get get the internal links working again.

Thanks for getting this ncx stuff going!

KevinH

KevinH
09-12-2011, 04:50 PM
Hi,

Okay I looked more at this index material. It appears the "type" information is key to understanding how to read in the indx information.

For example:

To correctly parse the indx entries, I had to do something like the following:

if type == 0x1f:
# handle next two variable width unknowns
pos, unk1 = getVariableWidthValue(navdata,offset)
offset += pos
print "unknown 1 is ", unk1
pos, unk2 = getVariableWidthValue(navdata,offset)
offset += pos
print "unknown 2 is ", unk2
if type == 0xdf:
# handle next threee variable width unknowns
pos, unk1 = getVariableWidthValue(navdata,offset)
offset += pos
print "unknown 1 is ", unk1
pos, unk2 = getVariableWidthValue(navdata,offset)
offset += pos
print "unknown 2 is ", unk2
pos, unk3 = getVariableWidthValue(navdata,offset)
offset += pos
print "unknown 3 is ", unk3
pos, unk4 = getVariableWidthValue(navdata,offset)
offset += pos
print "unknown 4 is ", unk4
if type == 0x3f:
# handle next threee variable width unknowns
pos, unk1 = getVariableWidthValue(navdata,offset)
offset += pos
print "unknown 1 is ", unk1
pos, unk2 = getVariableWidthValue(navdata,offset)
offset += pos
print "unknown 2 is ", unk2
pos, unk3 = getVariableWidthValue(navdata,offset)
offset += pos
print "unknown 3 is ", unk3

and then there is no need to look for or skip 0x80 values.

Also the count is not the same as the number of entries in the CTOC.

From my set of ebooks, the CTOC data always ends with '\0\0' double null bytes and it has variable length.

So I have attached a mobiunpack_test.py program that modifies things to work with a real amazon mobi ebook (as opposed to calibre generated ones).

Perhaps this might help others trying to track things down.

I am going to try and figure out what each of these unknowns actually means.

Hope this helps,

KevinH

siebert
09-12-2011, 05:36 PM
Okay I looked more at this index material. It appears the "type" information is key to understanding how to read in the indx information.


The various indexes seem to be very similar in mobi, so the ncx handling code should be able to reuse a lot of my code for the inflection index.

INDX0 is the meta index and the TAGX section can be parsed with readTagSection(). INDX1 is the actual index data, and the CTOC data is like the inflNameData.

Ciao,
Steffen

KevinH
09-12-2011, 05:40 PM
Hi,

Okay, here are what the extra variable length values mean in the indx;

The first unknown is actually the heading level
with 0- being toplevel, 1 indented 1 level, etc

The second unknown is actually an offset into the CTOC that describes the kind of entry. For my book this pointed to "cover", "other", "titlepage", "copyright", "part", "chapter", etc

If type = 0x3f
unknown 1 = heading level (this seems to be a 1)
unknown 2 = kind of entry (offset into CTOC)
unknown 3 = offset into index data which this entry should be listed under (ie. what it is a sub entry to)

if type = 0xdf
unknown 1 = heading level
unknown 2 = kind of entry (offset into CTOC) (in this case a "part")
unknown 3 = first indx entry included under this part
unknown 4 = last indx entry included under this part


For my ebook each "part" was a "Book 1", "Book 2", etc and under it where the individual "chapters" that below to that "part".

Again, hope this helps. I will examine some more books with complex toc's and try to figure out more.

KevinH

fandrieu
09-12-2011, 06:13 PM
I grabbed it and tried it on a bunch of mobis I had

Thanks very much for the effort you put into testing this.

The code is indeed very rough and there's more chances to get an exception than a ncx ;)

About the "anchor" bit it was really just a wild experiment that worked for the book a was working on...

I hadn't much time to read your posts now, but you are right, the first missing thing is to take the INDX entry type into account, and not just assume 0x0F=chapter as I did to try this out.

BTW i updated the zip i posted with some "essential" stuff that was missing:
* bail if INDX entry if not 0xF (todo...)
* correctly handle end of CTOC (nul terminated)
* basic checking of the indx_header (wrong in some test files)
* ...

KevinH
09-12-2011, 07:46 PM
Hi fandrieu,

I updated my mobiunpack_test.py to handle all of my mobi indx entries in my test set of mobis (quite limited actually!)

It simply documents things and prints out everything while trying to decipher INDX1, I did nothing with generating the ncx, just some debug output to help with multi-level ncx stuff when you get around to working on it.

So hopefully you will find this useful when incorporating your fixes and things into a real version.

Take care,

Kevin




PS: I tried this on a few other mobis and it barfed. It seems the record format is not even fully determined by the record type. It appears that heading level determines if the parent field is there or not, only specific record types have "kind" information, the order of the fields in each record seem to vary by type and header level.

Arrgghhh! What a mess? Perhaps the Mobi Version number might be useful in determining which fields are present for each record type!

So right now I have to read in record type and header level and to figure out what fields are stored and even that seems to vary from older mobis to newer mobis.

So my mobiunpack_test.py will only work for very specific cases.



PPS:

I added a newer mobiunpack_test.py (mobiunpack_test3.zip) that seems to work for more different mobi ebooks to decipher the INDX1 information.

Again feel free to pick and use as you see fit. Hope this helps!

KevinH

KevinH

kaizoku
09-13-2011, 11:03 AM
Someone should merge this in one release then to have so many.

DiapDealer
09-13-2011, 11:23 AM
Someone should merge this in one release then to have so many.
There is only one release — v .31 found on this post (http://www.mobileread.com/forums/showpost.php?p=774836&postcount=5) (ok, the previous version 0.30 is there as well, but you get my point). The rest of the files posted on this page are strictly experimental at this point.

KevinH
09-13-2011, 11:35 AM
Hi,

Yes, none of my "test" versions nor even the 31_fand versions are close to being ready for prime time. They are only a concise way to pass information back and forth (via reading the code) and is for those programmers who might be interested in helping figure out the indx sections to generate an ncx.

If and When a stable working version exists, it will be clearly marked as such and posted as a version 32 or later.

If you want to play around with things, grab my very latest mobiunpack_test.py (see above) and try it on a non-drm mobi ebook (preferably one with a large multi-level table of contents) and when it barfs (and it will!) pass a log of the program output as well as the ctoc.dat indx1.dat and indx0.dat files generated and stored into a single zip back to here and we can use your error messages to help figure out how our interpretation of the indx1 record structure is broken and hopefully fix it to be more robust.

KevinH

kaizoku
09-13-2011, 03:17 PM
Warning: Unknown metadata with id 125 found
Warning: Unknown metadata with id 405 found
Warning: Unknown metadata with id 406 found
Warning: Unknown metadata with id 407 found
Warning: Unknown metadata with id 403 found

Hopefully this helps.. only generated 1 .dat files and lots of .data files.

KevinH
09-13-2011, 03:28 PM
Is there any tutorial on how to run test mobiunpack on mac without a apple script support? I think new lion come with python installed.

Hi,

On a Mac you can use Terminal.app

In a new folder called "test" on your Desktop, copy in mobiunpack_test.py, and a non-drm .mobi ebook file of your choice. Then inside of that "test" folder create an output folder called "out".

Now double-click to run Terminal.app and enter the following commands
(replacing YOUR_MOBI_EBOOK.mobi with the name of the ebook you copied into the test folder):

cd ~/Desktop/test
python ./mobiunpack_test.py YOUR_MOBI_EBOOK.mobi out/ > debug.log


A few different files will be created:
indx0.dat
indx1.dat
ctoc.dat
debug.log

and inside of the out/ directory you should find the ncx, opf, html, etc

KevinH
09-13-2011, 04:26 PM
Hi,

Some more info. If you look in calibre-0.8.18 source tar gzip archive inside calibre/src/calibre/ebooks/mobi/writer2/ at indexer.py you can see the code that creates the index entries.

Just as siebert suggested, we should be parsing the TAGX entry in INDX0 and deciphering it to find out the various fields that are actually present for each index type.

I am going to study the TAGX object and more specifically the BITMASKS and how they are used to encode values that represent the fields available for each record type for that particular ebook.

class TAGX(object): # {{{

BITMASKS = {11:0b1}
BITMASKS.update({x:(1 << i) for i, x in enumerate([1, 2, 3, 4, 5, 21, 22, 23])})
BITMASKS.update({x:(1 << i) for i, x in enumerate([69, 70, 71, 72, 73])})

NUM_VALUES = defaultdict(lambda :1)
NUM_VALUES[11] = 3
NUM_VALUES[0] = 0

def __init__(self):
self.byts = bytearray()

def add_tag(self, tag):
buf = self.byts
buf.append(tag)
buf.append(self.NUM_VALUES[tag])
# bitmask
buf.append(self.BITMASKS[tag] if tag else 0)
# eof
buf.append(0 if tag else 1)

def header(self, control_byte_count):
header = b'TAGX'
# table length, control byte count
header += pack(b'>II', 12+len(self.byts), control_byte_count)
return header

@property
def periodical(self):
'''
TAGX block for the Primary index header of a periodical
'''
list(map(self.add_tag, (1, 2, 3, 4, 5, 21, 22, 23, 0, 69, 70, 71, 72,
73, 0)))
return self.header(2) + bytes(self.byts)

@property
def secondary(self):
'''
TAGX block for the secondary index header of a periodical
'''
list(map(self.add_tag, (11, 0)))
return self.header(1) + bytes(self.byts)

@property
def flat_book(self):
'''
TAGX block for the primary index header of a flat book
'''
list(map(self.add_tag, (1, 2, 3, 4, 0)))
return self.header(1) + bytes(self.byts)

...


The class IndexEntries in that file lists many of the same things I was able to pull out in my example cases:


class IndexEntry(object):

TAG_VALUES = {
'offset': 1,
'size': 2,
'label_offset': 3,
'depth': 4,
'class_offset': 5,
'secondary': 11,
'parent_index': 21,
'first_child_index': 22,
'last_child_index': 23,
'image_index': 69,
'desc_offset': 70,
'author_offset': 73,
}
RTAG_MAP = {v:k for k, v in TAG_VALUES.iteritems()}


And there are 3 routines that look particularly interesting in that file:


@property
def tag_nums(self):
for i in range(1, 5):
yield i
for attr in ('class_offset', 'parent_index', 'first_child_index',
'last_child_index'):
if getattr(self, attr) is not None:
yield self.TAG_VALUES[attr]

@property
def entry_type(self):
ans = 0
for tag in self.tag_nums:
ans |= TAGX.BITMASKS[tag]
return ans

def attr_for_tag(self, tag):
return self.RTAG_MAP[tag]


This is probably old hat to siebert but is all new to me so if anyone has any ideas how to properly decipher the "type" value to map it to the fields that are stored there, it would certainly help. Once we have that, it is relatively painless to write a recursive routine to process the parent / child relationships and convert it to a nice level and sorted list for output as a multilevel ncx.

My 2 cents,

KevinH

siebert
09-13-2011, 06:13 PM
This is probably old hat to siebert but is all new to me so if anyone has any ideas how to properly decipher the "type" value to map it to the fields that are stored there, it would certainly help.

First of all you have to decode the TAGX section for your index. I've documented that in the Wiki (http://wiki.mobileread.com/wiki/MOBI#TAGX_section).

Then you can decode the index entries with the tag table.

Each entry starts with the control byte(s) (the control byte count is defined in the meta index). Using the bit masks from the tag table you can decode which tags are in that index entry and how many entries of each tag.

A bit mask could theoretically contain more than two bits, but I've seen so far only one and two bit masks. If a two-bit mask is all set to 1, it doesn't mean 4 entries of that tag, but after the control byte(s) is another value defining how many entries of that tag are in the entry.

So the control bytes encodes 0, 1, 2, 3 or many entries.

The tag table also defines, how many values each tag has.

With that information you can get all values from an index entry. If you know the meaning of the tag, you can use the values to get the necessary information.

Example:

Control byte count is 1. The tag table has three entries:
0x08, 0x01, 0x03, 0x00 (tag 0x08 has one value and the bitmask 0b11)
0x0a, 0x02, 0x04, 0x00 (tag 0x0a has two values and the bitmask 0b100)
0x00, 0x00, 0x00, 0x01 (end of control byte indictator)

If the first byte of an index entry is 0b00000111, we do an AND operation with the first bitmask and see that the result is 0b11, meaning we must read the next byte to get the actual count of tag 0x08 entries. Let this value be 0x05.

Now we do an AND operation with the next mask and get the result 0b1, so we know that there is one 0x0a entry.

So we've already processed the first two bytes and must now read 5 variable length values for the 5 0x08 tags and 2 variable length values for the one 0x0a tag (as each 0x0a entry contains two values).

If the control byte is 0b00000010, we must read two variable length values for two 0x08 tags.

That's all :)

I hope it's now clear how to decode an index entry and that I didn't make any mistakes in my description.

As I've said before, the code for this handling is already available in mobiunpack and should be reusable for the ncx index handling.

Ciao,
Steffen

KevinH
09-13-2011, 09:41 PM
Hi seibert,

Thanks! That helps. I can now decipher the TAGX and find the bitmaps that are used to encode the record type information. I can guess at the what each tag byte means but that is only a guess. Is there any place that documents the meaning of each tag value or did you have to reverse engineer them from the kindlegen program?



For the record, here is what we know/guess based on the work done so far:


Tag Decimal Meaning
0x01 01 position in the file for the link destination
0x02 02 length / size
0x03 03 title/label offset into CTOC
0x04 04 depth/level of heading (0 = toplevel, 1 = one level down, etc)
0x05 05 class/kind offset into CTOC
0x15 21 parent record number
0x16 22 first child record number
0x17 23 last child record number


which maps exactly to what calibre uses in its indexer.py:


class IndexEntry(object):

TAG_VALUES = {
'offset': 1,
'size': 2,
'label_offset': 3,
'depth': 4,
'class_offset': 5,
'secondary': 11,
'parent_index': 21,
'first_child_index': 22,
'last_child_index': 23,
'image_index': 69,
'desc_offset': 70,
'author_offset': 73,
}



So I guess we will have to work with that. We can try to modify the code to use your TAGX parsing routine to get the tag values and bit masks and then use those to decipher the "type" entry.

Thanks,

Kevin

DaleDe
09-13-2011, 09:48 PM
This is great interaction and development. I wonder if the dev hub available here would be better for the purpose.

pdurrant
09-14-2011, 03:50 AM
An interesting idea. I haven't really explored the dev hub.

siebert
09-14-2011, 04:41 AM
I can guess at the what each tag byte means but that is only a guess. Is there any place that documents the meaning of each tag value or did you have to reverse engineer them from the kindlegen program?


The meaning has to be reverse engineered, but that should be easy compared to reverse engineering the index entry structure I described, it took me weeks to figure that out... now that I know it, it appears no longer to be that difficult, though ;)

Ciao,
Steffen

fandrieu
09-14-2011, 05:52 AM
Hi all,

Lots of new discoveries...still a lot of reverse engineering to do...
As siebert pointed out, the TAGX data indeed seems mandatory to correctly make sense of the INDX.

In the meantime i worked a little on the code, playing with multi levels toc books.
(in the first version i only tried flat tocs, i still haven't touched periodicals...).

* On the "making sense of the data" front, I started with the work done by KevinH in his test version and the only meaningful thing I added is the handling of type 0x7f entries.

They appear to be like 0x1f entries, but of "intermediary" level.

* Other than that i reformated my ugly "TEST NCX" block.
It's now separated from the rest in a method called "parseINDX": it's more readable and easier to call elsewhere in unpackBook.
It also allows to put a bunch of "if error return false" in a row instead of nesting ifs and ifs...

* I also added a DEBUG_NCX global option: it prints a lot of debug and does nothing more than parsing the NCX

* Finally, now that multi level tocs are somehow parsed, i rewrote the "write the ncx file" code to support that.
A new "sortINDX" method re-orders the raw data in the same "flow" as in the NCX, keeping the "hlvl" info instead of forcing 1 as before...



Here is a zip file containing the new file.

I also included the source & mobi of the test book I worked on to find out about the x7f entries, it's just a python-generated dummy book (so no copyright problems) where i can set the toc depth.
The actual file included as 4 levels with 2 entries in each one, it was compiled with kindlegen and then stripped...

Nice work, good luck for what's left to do...

siebert
09-14-2011, 07:54 AM
An interesting idea. I haven't really explored the dev hub.

It seems to support only subversion (*shudder*)...

As I've said before, I pushed my git repository to github and any fellow developer should feel free to create forks for their own development which can be merged if a feature is ready: https://github.com/siebert/mobiunpack

Ciao,
Steffen

KevinH
09-14-2011, 09:27 AM
Hi fandrieu,

Great work!

I will take a shot at combining your latest version with a version that uses Siebert's readTag routine to parse the TAGX which can be found in the indx0 section to find the field bitmaps for each tag and parse them. That way we can forget about all of the if type == 0x1f lines and just use the correct bitmaps to decipher which fields are present and then read them.

Thanks!

KevinH

KevinH
09-14-2011, 09:38 AM
Hi DaleDe,

This is great interaction and development. I wonder if the dev hub available here would be better for the purpose.

Interesting idea. Paul and I hosted a code.google.com site for mobiunpack.py but we received almost no contributions or input over the years. Siebert was the first new developer to come on and he found the source on this site (not our code.google.com dev site) and after his extensive changes he added his own git site.

Based on similar experiences from other small (couple of files only) dev projects, it appears to me that using development specific hosting with its own hurdle of concurrent versioning tool (git vs svn vs mercurial vs cvs vs rcs, etc.) and the lack of visits by users who might have an "itch to scratch" simply lowers contributions.

I think the same thing happens with users of both Sigil and Calibre. They are constantly pointed to other official sites but most of the impetus for change is done or initiated via MR.

So unless we are disrupting things with our posts, I would prefer to keep things here just to maximize our exposure to new users (and hopefully potential developers) who might want to contribute a new feature or quick fix.

My 2 cents ...

KevinH

fandrieu
09-14-2011, 10:07 AM
KevinH, sorry to flood the thread with zips, but here a new version :)

I tried the NCX code against on all the mobis I could lay my hands on...

The only "real" error I got was with really fat ebooks (technical books with more than a thousand entries), the INDX1 is splitted across more than one section !

I first added a few checks to prevent exceptions, but more importantly found out the the actual number of "data" INDX sections is stored in the INDX0.

So I modified the code to take this into account and parse multiple INDXx.
In the zip file you'll find a file for this test case, a dummy book with 4000 entries on 5 levels (that's a 600kb ncx...)

While I was at it, as suggested by siebert, I used his tagx code to parse the rest of INDX0, but still doesn't do anything with the data.

Please use this version instead if the previous if you plan on integrating the changes.

Thanks, fand.

PS: i also included the (simplistic) script I used to test the code on all my books, if someone interested...

DiapDealer
09-14-2011, 12:39 PM
Hi, the above (mobiunpack_testncx2.zip) test script isn't recognizing the ncx in most of my mobi's. The multi-level stuff seems to be off by one. Any of my mobi's that have a strictly flat ncx (one level), the script mistakenly reports as having "No ncx." And with a mobi that has a two-level ncx, the script builds a one-level (flat ncx file)... ignoring the parent level if an entry has a parent.

I may be wrong, but I seem to remember something about calibre flattening the ncx regardless. I'm not sure the Kindle properly handles a multi-level ncx file. Something about only the parent levels (and not the children) showing on the progress bar as "jump points" (which is the only thing useful function the ncx provides on a Kindle). I could be completely mistaken about all that, though... I'll have to do some testing.

fandrieu
09-14-2011, 01:28 PM
Hi, the above (mobiunpack_testncx2.zip) test script isn't recognizing the ncx in most of my mobi's. The multi-level stuff seems to be off by one. Any of my mobi's that have a strictly flat ncx (one level), the script mistakenly reports as having "No ncx." And with a mobi that has a two-level ncx, the script builds a one-level (flat ncx file)... ignoring the parent level if an entry has a parent.

I wouldn't be surprised if it's off by one, quite the contrary I don't expect the code to be correct at this stage ;)
But for now I couldn't find a book to reproduce the problem, that's pretty weird, i'll look into it further...

I may be wrong, but I seem to remember something about calibre flattening the ncx regardless. I'm not sure the Kindle properly handles a multi-level ncx file. Something about only the parent levels (and not the children) showing on the progress bar as "jump points" (which is the only thing useful function the ncx provides on a Kindle). I could be completely mistaken about all that, though... I'll have to do some testing.

As far as I know I completely agree and all that makes multi-level NCX pretty useless for now.
But anyway kindlegen does produce this kind of file and my goal with this code was to extract as much from the mobi as possible, so that you can re-compile the files from mobiunpack into an as-identical-as-possible new mobi...

KevinH
09-14-2011, 03:22 PM
Hi All,

Okay, I took fandrieu's latest, and modified it to pass the tagx info to the readINDX1 routine and fixed an off by one in the code that sorts the NCX.

I think this should now be close.

PS: Actually I still think sortINDX has an off-by-one issue and my change may not be the correct one! My change fixed my problem but will probably fail for some other case. Recursion is so fun!

Either way it needs to be worked on and fixed. We should also re-factor things into classes and maybe even separate it into files that encapsulate the various functions in some smarter way.

DiapDealer
09-14-2011, 05:08 PM
I'm getting good results with these latest scripts. I'm still trying to find something in one of my books that breaks it, but I'm not having much luck. ;)

Either way it needs to be worked on and fixed. We should also re-factor things into classes and maybe even separate it into files that encapsulate the various functions in some smarter way.
I'm all for class-ifying, but if given a vote, I would rather that mobiunpack remain one self-contained script.

fandrieu
09-14-2011, 05:28 PM
I'm getting good results with these latest scripts. I'm still trying to find something in one of my books that breaks it, but I'm not having much luck. ;)

I just found a book with the same kind of problem:
calibre fetched a scheduled feed just while i was testing some files, so i tried the resulting "periodical" mobi and that was it :)

It seems the problem is with the INDX parsing, i got the output:

parsed INDX header:
len 192 nul1 0 type 1 gen 0 start 1256 count 54 code 4294967295 lng 4294967295 total 0 ordt 0 ligt 0
contextual data @ xB
DF 0 -1 1 6
contextual data @ x98
2 2 E2 -1 -1
contextual data @ x127
46 2 E2 -1 -1

which shows that from the second entry everything is mangled.
There's actually an extra VWI in the first "DF" entry so the rest is shifted.

I guess the right way to fix should be to use the TAGX data to reliably know what to expect in the entries.
In this particular case our current "type-based" rules might work if we took into account the differences between book & periodical style indexes...but i'm yet to fiddle with that...

EDIT:
I missed KevinH last post...
Thanks for the tagx code i'll look into it
And yes there were some errors in the sortINDX code :( i actually (silently out of shame ;)) reuploaded the zip earlier with >= replaced by > in the first test and other fixes :)

EDIT2:
tagx: pretty impressive, many thanks for quickly implementing this tagx bit i had skipped altogether :)
sortINDX: you got the second ">0" error but missed the one i mentioned above ;)
refactor: i was toying with the oop approach before but wouldn't do it to keep in sync with other versions, but i have a mobiunpack_ootest.py somehere...

pdurrant
09-14-2011, 06:04 PM
Bear in mind that calibre-generated Mobipocket files might not be valid in all instances, since the code was written with reverse-engineered info, not with documentation of the format.

KevinH
09-14-2011, 07:02 PM
Hi All,

Okay I merged the fixes that fandrieu made to his version (fixes to sortINDX, other changes) and added in a few other typo fixes and now I think we have a version we can use as the basis for public testing and as a basis for refactoring into classes while trying to keep to just one file.

Very nice work fandrieu!

mobiunpack_fand_updated2.zip is attached.

KevinH

DiapDealer
09-14-2011, 07:58 PM
The above script is slightly broken for MOBI's that have no NCX (when DEBUG_NCX is set to False). In that circumstance, the outncx variable is referenced before it's assigned in the unpackBook function. The <spine> element is also incorrect in the opf for a MOBI with no ncx file.

I made two small changes to the unpackBook function that make it work for MOBI's with no NCX. A quick diff will reveal the simple changes.

I'm having quite a bit of success with unpacking various books and rebuilding them with Kindlegen. :cool:

KevinH
09-14-2011, 08:06 PM
Hi DiapDealer,

Nice catch! I never actually tested it on a book without an NCX.
If your version seems to work for everyone, then we have one to release before we attempt the refactoring/adding of classes.

Thanks,

KevinH

The above script is slightly broken for MOBI's that have no NCX (when DEBUG_NCX is set to False). In that circumstance, the outncx variable is referenced before it's assigned in the unpackBook function. The <spine> element is also incorrect in the opf for a MOBI with no ncx file.

I made two small changes to the unpackBook function that make it work for MOBI's with no NCX. A quick diff will reveal the simple changes.

I'm having quite a bit of success with unpacking various books and rebuilding them with Kindlegen. :cool:

fandrieu
09-14-2011, 09:54 PM
Hehe, i didn't take the time to check your latest fixes (pretty late here), but you seem to have spotted the misplaced outncx=False line ;)

I just wanted to add another bit that troubled me:
I merged the (hopefully fixed) sortINDX & buildNCX functions, removing an "evolutionary" clutch with the added bonus of correct indenting (but didn't take much time to test it though...)

siebert
09-15-2011, 10:42 AM
Hi,

I've looked into the latest source provided by fandrieu and the handling seems to make some shortcuts. I assume that the ncx index also contains a IDXT section, why don't you don't use it to find the start and end position of each entry, so you can verify that you've decoded all bytes?

The tag handling code will work only if all bitmasks are single bits. Is this always the case? I would then at least add an assertion which will fail for non-single bitmasks.

Ciao,
Steffen

KevinH
09-15-2011, 12:41 PM
Hi,

I've looked into the latest source provided by fandrieu and the handling seems to make some shortcuts. I assume that the ncx index also contains a IDXT section, why don't you don't use it to find the start and end position of each entry, so you can verify that you've decoded all bytes?

The tag handling code will work only if all bitmasks are single bits. Is this always the case? I would then at least add an assertion which will fail for non-single bitmasks.

Ciao,
Steffen

Hi Steffen,

If type & mask == mask should work for whatever the bitmask is assuming it is truly a mask (ie. that all bits set in the mask exist (are set) in the type value since & is a bitwise operator

If the tagx bitmask has more than one bit set, then that is captured by the mask.

What am I missing?

siebert
09-15-2011, 12:50 PM
If type & mask == mask should work for whatever the bitmask is assuming it is truly a mask (ie. that all bits set in the mask exist (are set) in the type value since & is a bitwise operator

[...]

What am I missing?

You're only testing whether all bits are set or not. For a 1-bit mask this is ok, as it can only encode the values 0 and 1.

With a two-bit mask the encoded values are 0, 1, 2 and 3 (but 3 means more than 2 and the real value is stored in a separate byte). The current code doesn't decode these values.

Ciao,
Steffen

KevinH
09-15-2011, 04:49 PM
You're only testing whether all bits are set or not. For a 1-bit mask this is ok, as it can only encode the values 0 and 1.

With a two-bit mask the encoded values are 0, 1, 2 and 3 (but 3 means more than 2 and the real value is stored in a separate byte). The current code doesn't decode these values.

Ciao,
Steffen

Hi Steffen,

I am not sure I understand

If I assume the following (note tag1 has more than 1 bit set in its mask)

tag 1 has bitmask 0x03 and requires 1 value be read in as field 1
tag 2 has bitmask 0x01 and requires 1 value be read in as field 2
tag 3 has bitmask 0x02 and requires 1 value be read in as field 3
tag 4 has bitmask 0x08 and requires 1 value to be read in as field 4

And if type == 0x07:

I would read in the first value as field 1, next value as field 2, next value as field 3 and no further values would be read in for this particular entry since the bitmask & type != bitmask for tag 4.


I think you are saying this is wrong ...

I think you are saying that a bitmask with two bits set means something different from how I am interpreting it?

If so, via a concrete example, could you explain how a bitmask with more than 1 bit set should be interpreted if it is not as I had assumed above.

Thanks!

Kevin

siebert
09-15-2011, 05:20 PM
Hi Steffen,

I am not sure I understand

If I assume the following (note tag1 has more than 1 bit set in its mask)

tag 1 has bitmask 0x03 and requires 1 value be read in as field 1
tag 2 has bitmask 0x01 and requires 1 value be read in as field 2
tag 3 has bitmask 0x02 and requires 1 value be read in as field 3
tag 4 has bitmask 0x08 and requires 1 value to be read in as field 4

And if type == 0x07:

I would read in the first value as field 1, next value as field 2, next value as field 3 and no further values would be read in for this particular entry since the bitmask & type != bitmask for tag 4.


0x07 = 0b00000111
Tag 1:
0x03 = 0b00000011
0x07 AND 0x03 = 0x03
This would mean that we have 3 values of tag 1. But as I've said, for multi-bit masks a result of all ones (like in this case) the real number can be anything > 2 and you have to read one byte (or a multibyte value, don't remember which one) to get the real number of tag 1.

If type would be 0x06 instead of 0x07:
0x06 and 0x03 = 0x02
This would mean we have 2 values of tag 1

Tag2:
I'm confused. The mask 0x01 collides with the mask 0x03. It's not possible to have a tag with mask 0x01 and another with mask 0x03 in the same control byte.

The control byte works as follows. You have one byte (8 bits) and want to encode the number of tag values for several tags with these 8 bits.

If a tag can occur only once, you need one bit. If it can occur several times, you need more bits. All masks I've seen so far had a maximum of 2 bits.

Let's say you have 3 tags with one bit and one tag with two bits than you should see the following masks:

0b00000001 = 0x01 for tag1
0b00000010 = 0x02 for tag2
0b00000100 = 0x04 for tag3
0b00011000 = 0x18 for tag4

A control byte of 0x15 would then decode as:
0b00010101

1 * tag 1
0 * tag 2
1 * tag 3
2 * tag 4

I hope it's now clear what I mean.

Ciao,
Steffen

KevinH
09-15-2011, 06:52 PM
Hi Steffen,

Yes, thanks! That is much clearer.

An NCX entry can never have more than one parent, can only have one position, one class, one length, and although it could have many children, the children are actually indicated by two different values which provide a range - the first of which is the record number of the first child of this ncx entry and the second of which is the record number of the last child of this ncx entry (as a range).

So it appears that 1 bit is only ever needed. I would guess your inflection dictionaries are much much more complicated than the ncx entries.

So because of the structure of the fields, I believe that multi-bit masks as you describe below are never used.

And I agree we should at least run a test and warn if multi-bit masks are ever found in the NCX code.

Thanks!

Kevin


Let's say you have 3 tags with one bit and one tag with two bits than you should see the following masks:

0b00000001 = 0x01 for tag1
0b00000010 = 0x02 for tag2
0b00000100 = 0x04 for tag3
0b00011000 = 0x18 for tag4

A control byte of 0x15 would then decode as:
0b00010101

1 * tag 1
0 * tag 2
1 * tag 3
2 * tag 4

I hope it's now clear what I mean.

DiapDealer
09-15-2011, 07:22 PM
Hehe, i didn't take the time to check your latest fixes (pretty late here), but you seem to have spotted the misplaced outncx=False line ;)

I just wanted to add another bit that troubled me:
I merged the (hopefully fixed) sortINDX & buildNCX functions, removing an "evolutionary" clutch with the added bonus of correct indenting (but didn't take much time to test it though...)
Your latest (mobiunpack_testncx_onemore.zip) still has a bit of a bug if the mobi doesn't have an ncx. You're building the opf so that the spine always indicates the toc="ncx" bit. Like so (line 1540):
if outncx:
outncxbasename = os.path.basename(outncx)
data += '<item id="ncx" media-type="application/x-dtbncx+xml" href="'+outncxbasename+'"></item>\n'
data.append('</manifest>\n<spine toc="ncx">\n<itemref idref="item1"/>\n</spine>\n<tours>\n</tours>\n')

What it needs to be is:
if outncx:
outncxbasename = os.path.basename(outncx)
data += '<item id="ncx" media-type="application/x-dtbncx+xml" href="'+outncxbasename+'"></item>\n'
data.append('</manifest>\n<spine toc="ncx">\n<itemref idref="item1"/>\n</spine>\n<tours>\n</tours>\n')
else:
data.append('</manifest>\n<spine>\n<itemref idref="item1"/>\n</spine>\n<tours>\n</tours>\n')

KevinH
09-15-2011, 08:37 PM
Hi All,

To prevent duplication of effort ...

Does anyone here want to take a shot at refactoring and or adding classes. I would assume class-wise we could have an NCX related class, and OPF related class, and a Dictionary related classes (and the NCX and Dictionary could share and TAGX, INDX classes if need be) and try and clean up the code, shrink it wherever possible, and make sure the routines that obviously belong to a class get encapsulated by that class.

Any Takers?

KevinH

siebert
09-16-2011, 01:32 AM
Hi Steffen,

Yes, thanks! That is much clearer.

[...]

So it appears that 1 bit is only ever needed. I would guess your inflection dictionaries are much much more complicated than the ncx entries.


I think we should follow the DRY (don't repeat yourself) principle.

The getTagMap() function should already do everything needed to decode an index entry (including multi-bit masks), so I suggest using it also for ncx index handling.

Ciao,
Steffen

fandrieu
09-16-2011, 06:03 AM
Bear in mind that calibre-generated Mobipocket files might not be valid in all instances, since the code was written with reverse-engineered info, not with documentation of the format.

Absolutely, btw does anyone knows a way to spot calibre-generated books / identify the book generator ?


I assume that the ncx index also contains a IDXT section, why don't you don't use it to find the start and end position of each entry, so you can verify that you've decoded all bytes?

I was thinking about what's left to parse, and so the INDXT section at the end of INDX0 came to mind.

I didn't you what was in there when I started this, if i understand you right it contains the position of the actual index entries in the INDX1, is that it ?

If someon didn't already do it, I'll look into it...


Your latest (mobiunpack_testncx_onemore.zip) still has a bit of a bug if the mobi doesn't have an ncx. You're building the opf so that the spine always indicates the toc="ncx" bit.

You're absolutely right, and the worse is I new about it but had never fixed it :(
That's what happens you release proof of concept code in a haste :)

...

About the TAGX / mask stuff thanks for all your input and shaping this code into something correct, I knew too little about the mobi format to figure that out quickly...

About the refactor bit go ahead KevinH if you feel like it, perhaps you could share here a skeleton of the class structure, so that other can comment / improve on it ?

KevinH
09-16-2011, 08:48 AM
Hi,


About the refactor bit go ahead KevinH if you feel like it, perhaps you could share here a skeleton of the class structure, so that other can comment / improve on it ?

Actually classes have restarted and my teaching/research takes up most of my free time now so I was actually fishing for someone else to take over that duty!!!!

KevinH

fandrieu
09-16-2011, 09:11 AM
As siebert suggested, I modified the code to use the IDXT data in INDX1 to "find" the entries.

Before we relied exclusively on the TAGX data and assumed that after having parsed an entry we would be correctly positioned at the start of the next one.

Now the offsets found in the IDXT are used to (i hope) accurately find each entry: there should be no more "positioning" bugs.

(note that to do that I chose to pass the whole INDX section to parseINDX1 (before it was only the "navdata" part) so that the offsets in IDXT are used verbatim...)

....

I also fixed "buildNCX" to correctly (i guess) set "playOrder" and "dtb:depth".

....

Also I didn't mention it before, but in my previous version I introduced some code to include the "filepos" found in INDX in the "anchor" algorithm.
Without that, if a link is found only in the NCX and not in the html itself, the corresponding anchor would be missing.

...

@KevinH: I'll try to look into it over the week end if possible....

kovidgoyal
09-16-2011, 12:36 PM
I just came across this thread. Some tips:

1) Do not use the INDX entries to build an NCX. INDX entries can have a maximum depth of two for books and 3 for periodicals. This is a limitation of the MOBI format. Instead parse the inline TOC, calculate the left indents and reconstruct the NCX from that. See code in mob/reader.py in calibre to do that.

2) If you still want to decompile indx entries and are looking at calibre code to figure out index entries:

a) note that currently indexer.py does not generate depth 2 indx entries for books, primarily because I got tired figuring out the TBS indexing for depth to book nodes.
b) you should look at the code in mobi/debug.py which is designed to decompile arbitrary MOBI files including the index and TBS information. You can run that code with calibre-debug --inspect-mobi filename.mobi

DiapDealer
09-16-2011, 01:34 PM
Hi fandrieu,

I'm getting consistent results with many, many books with the latest script (from post #206), so thanks for your efforts!

I wanted to point out one potential problem area, though. Line 1036:
entry = re.sub('^', indent, entry, 0, re.M)
The above code will only work with python 2.7. If you want to include 2.6 and 2.5 users, consider replacing that line with this compatible code:
entry = re.sub(re.compile('^', re.M), indent, entry, 0)

KevinH
09-16-2011, 02:17 PM
Hi,

And for my own personal sanity please change one other little thing:

print "Wite NCX"

to

print "Write NCX"

;-)

Thanks!

Kevin

Hi fandrieu,

I'm getting consistent results with many, many books with the latest script (from post #206), so thanks for your efforts!

I wanted to point out one potential problem area, though. Line 1036:
entry = re.sub('^', indent, entry, 0, re.M)
The above code will only work with python 2.7. If you want to include 2.6 and 2.5 users, consider replacing that line with this compatible code:
entry = re.sub(re.compile('^', re.M), indent, entry, 0)

KevinH
09-16-2011, 02:25 PM
Hi Kovid,

I just came across this thread. Some tips:
1) Do not use the INDX entries to build an NCX. INDX entries can have a maximum depth of two for books and 3 for periodicals. This is a limitation of the MOBI format. Instead parse the inline TOC, calculate the left indents and reconstruct the NCX from that. See code in mob/reader.py in calibre to do that.


I think the idea is to eventually use "mobiunpack.py" as a way for people to take mobi's generated by KindleGen, unpack them making the fewest changes as possible", Allow the user to make whatever changes they want and then pass the whole thing back through KindleGen to get back a mobi.

So I think the idea is to generate the NCX that is stored inside the mobi and pass it back in so that it get's regenerated in the exact same way.

Thus the idea to look at the internal ncx and not parse to create one of our own.


2) If you still want to decompile indx entries and are looking at calibre code to figure out index entries:

a) note that currently indexer.py does not generate depth 2 indx entries for books, primarily because I got tired figuring out the TBS indexing for depth to book nodes.


Used your indexer.py code to verify what the tag values are and what they mean (parent, first_child, last_child, class, etc). Our code already handle's reading in depth 2 for ebooks (tested with books from Kindlegen, etc). But I have not tried it with a Periodical at all.


b) you should look at the code in mobi/debug.py which is designed to decompile arbitrary MOBI files including the index and TBS information. You can run that code with calibre-debug --inspect-mobi filename.mobi


Great tool. Will do.

Thanks,

KevinH

kovidgoyal
09-16-2011, 02:29 PM
Ah, you want to only allow editing of the content, not the TOC...then yes, decompiling the NCX from the INDX records is fine.

siebert
09-16-2011, 03:29 PM
Absolutely, btw does anyone knows a way to spot calibre-generated books / identify the book generator ?


The EXTH value 204 (creator software) is set by the official tools (kindlegen, mobigen, mobipocket creator) with distinct values, but unfortunately the latest calibre versions also set this value, pretending to be a kindlegen.

I asked Kovid to remove that "feature", but he refused to do so.

So the only way to find out if a mobi files was created by calibre or not is to search the contributor EXTH (value 108) for the string "calibre".

Ciao,
Steffen

DiapDealer
09-17-2011, 02:05 AM
Hey fandrieu,

I can't find where it's happening for the life of me, but something in your code is doubling the html output. It doesn't happen with books that have no ncx, but if it does have ncx data, the html file being produced is the original html x 2: as in the whole book and then the whole book all over again. ;)

Maybe you'll have better luck spotting it.

fandrieu
09-17-2011, 10:09 AM
@kovid: thanks very much for your great app !

@sieberd: thanks for the info, i'll look into it.

I can't find where it's happening for the life of me, but something in your code is doubling the html output.

I found a file to debug that :)
Here's a fixed version including the typos mentioned earlier.

...

The bug was a weird collateral of the "ncx filepos anchor injection" code.
The duplication occured because an entry had -1 as filepos, meaning an extra large "dataList".

The cause was IDXT parsing, some files have an extra null entry at the end.

I added two checks to prevent both problems.

...

REUP: after posting that i realized it'd be better to use header['count'] to determine when to stop parsing IDXT...

kaizoku
10-17-2011, 07:31 PM
I am getting these errors with mobi:

Unpacking Book...
Mobipocket version 6
Warning: Unknown metadata with id 405 found
Warning: Unknown metadata with id 406 found
Warning: Unknown metadata with id 407 found
Warning: Unknown metadata with id 403 found
Write ncx
Palmdoc compression
Unpack raw html
Decode images
Find link anchors
Insert data into html
Insert hrefs into html
Remove empty anchors from html
Insert image references into html
Write html
Write opf
Completed

But no dat was created just ncx opf and html

pdurrant
10-18-2011, 03:34 AM
I am getting these errors with mobi:

Unpacking Book...
Mobipocket version 6
Warning: Unknown metadata with id 405 found
Warning: Unknown metadata with id 406 found
Warning: Unknown metadata with id 407 found
Warning: Unknown metadata with id 403 found

[snip]

But no dat was created just ncx opf and html

Those aren't errors, just warnings. The data in those EXTH sections will be in the opf inside some comments, as far as I remember.

Doitsu
10-24-2011, 06:16 PM
While the tool works great with regular books, it seems to have problems with some dictionaries. When I unpacked my simple Swedish-English dictionary (http://www.mobileread.com/forums/showthread.php?t=133797), which I uploaded some time ago, I noticed that the <DictionaryInLanguage> value was not correctly recovered. My original .opf file contained the following entries:

<DictionaryInLanguage>sv</DictionaryInLanguage>
<DictionaryOutLanguage>en-us</DictionaryOutLanguage>

However, the reverse engineered .opf file contained the following entries:

<DictionaryInLanguage>en</DictionaryInLanguage>
<DictionaryOutLanguage>en-us</DictionaryOutLanguage>

Since the <DictionaryInLanguage> value is used by the Kindle for automatic dictionary selection, a wrong value will cause problems.

I also got a lot of "Delete operation of inflection rule failed" error messages, but I remember having read somewhere that there are still problems with inflections.

pdurrant
10-25-2011, 02:57 AM
While the tool works great with regular books, it seems to have problems with some dictionaries. When I unpacked my simple Swedish-English dictionary (http://www.mobileread.com/forums/showthread.php?t=133797), which I uploaded some time ago, I noticed that the <DictionaryInLanguage> value was not correctly recovered. My original .opf file contained the following entries:

<DictionaryInLanguage>sv</DictionaryInLanguage>
<DictionaryOutLanguage>en-us</DictionaryOutLanguage>

However, the reverse engineered .opf file contained the following entries:

<DictionaryInLanguage>en</DictionaryInLanguage>
<DictionaryOutLanguage>en-us</DictionaryOutLanguage>

Since the <DictionaryInLanguage> value is used by the Kindle for automatic dictionary selection, a wrong value will cause problems.

I also got a lot of "Delete operation of inflection rule failed" error messages, but I remember having read somewhere that there are still problems with inflections.

Thanks for the bug report. Hopefully the guys working on the dictionary support can fix it up.

pdurrant
10-27-2011, 05:46 AM
Thanks for the bug report. Hopefully the guys working on the dictionary support can fix it up.

Well, actually I took a quick look, and added a language entry to fix that problem. There's still some work to me done on decoding the inflection rules. I'm sure that "Error: Delete operation of inflection rule failed" needs to be fixed. Perhaps someone with the source to a dictionary with infections could have a go?

Anyway, I've uploaded version 0.32 to the fifth post in this thread. It includes some refactoring by DiapDealer, which will hopefully make maintenance easier.

siebert
10-27-2011, 06:01 AM
There's still some work to me done on decoding the inflection rules. I'm sure that "Error: Delete operation of inflection rule failed" needs to be fixed. Perhaps someone with the source to a dictionary with infections could have a go?

The delete rule contains the letter which should be deleted. The error is given if that letter doesn't match the current letter of the word at the position where the deletion should be performed.

Normally this shouldn't happen, but for non-ascii letters the letters don't match. I assume that the text encodings of the rules and the actual text are different. But I have no idea what encoding is used in the rules.

I could provide the source of a minimal dictionary which shows the error if needed.

Ciao,
Steffen

pdurrant
10-27-2011, 07:00 AM
I could provide the source of a minimal dictionary which shows the error if needed.


If you could do that (source and compiled dictionary, if possible), I'll take a look. Thanks.

siebert
10-27-2011, 05:36 PM
Ok, here is the promised sample dictionary. The first entry creates the error.

78235

Ciao,
Steffen

pdurrant
10-28-2011, 08:08 AM
Ok, here is the promised sample dictionary. The first entry creates the error.

In one of the places in the sample dictionary, the german double-s ß is stored at 0x0573 instead of as 0xDF (the Windows Latin-1 encoding for ß).

A quick hack at the right place to substitute 0xDF back in for 0x0573 fixes things for this instance.

Unfortunately, I don't really understand why the error is happening, and it isn't a general fix — there are still problems with the swedish dictionary mentioned above.

Perhaps with access to the source for the swedish dictionary, it might be possible to work out what's going on.

Doitsu
10-29-2011, 06:59 AM
Perhaps with access to the source for the swedish dictionary, it might be possible to work out what's going on.

I did some tests and found out that the script seems to stumble over inflection entries with both a hyphen and an umlaut in them. For example:

<idx:infl><idx:iform value="abc-böckers"/></idx:infl>

Please find attached a small sample of the Swedish dictionary whose first entry will cause 4 Error: Delete operation of inflection rule failed messages when the .prc file is unpacked.

Unfortunately, there seem to be other serious issues with accented characters which you'll see when you look at the original and the reconstructed .html files.

Even though the reconstructed dictionary looks the same as the original when it's compiled, it no longer works as a dictionary.

siebert
10-29-2011, 07:14 AM
Even though the reconstructed dictionary looks the same as the original when it's compiled, it no longer works as a dictionary.

What exactly do you mean with "no longer works as a dictionary"?

In mobipocket reader, a dictionary uses javascript to implement dictionary search. This might indeed not work.

My focus was to use the recompiled dictionary in the kindle app (as the formatting of my originial dictionary made it unsuitable for the popup dictionary window), which should work as the kindle app only uses the dicitionary index and doesn't support javascript (so I removed the javascript code before recompiling the dictionary).

Ciao,
Steffen

Doitsu
10-29-2011, 07:38 AM
What exactly do you mean with "no longer works as a dictionary"?
It no longer works as a lookup dictionary. I.e. it should work exactly as the original.

My focus was to use the recompiled dictionary in the kindle app [...]
I assumed the objective of the mobipunpack.py developers was to re-create the original source files as good as possible so that you could theoretically unpack a dictionary correct an entry and recompile it without any loss of functionality. Currently this doesn't seem to work.

I know that the Kindle app doesn't allow users to select user dictionaries anyway, but it is possible to patch the ASIN number of a user dictionary so that it matches the ASIN of one of the 5 official dictionaries.

IMHO, it doesn't make much sense to convert a dictionary to a Mobipocket ebook because the user looses the dictionary functionality.

[...]so I removed the javascript code before recompile the dictionary).
AFAIK, Javascript code is not required in Mobipocket dictionary .html source files and wasn't present in my dictionary .html source file before it was compiled.

Please have a look at the original .html source file and the one that the script re-creates and you'll see that they differ significantly and I'm not talking about whitespace characters and line-breaks.

siebert
10-29-2011, 08:27 AM
It no longer works as a lookup dictionary.


You still didn't reveal which application you use to view the dictionary...


I assumed the objective of the mobipunpack.py developers was to re-create the original source files as good as possible so that you could theoretically unpack a dictionary correct an entry and recompile it without any loss of functionality.


Yes, this is the final goal. We know that it's not reached yet. A version number < 1.0 might give the hint that we don't see the script as finished ;)


Currently this doesn't seem to work.


My dictionary itch has been scratched by the existing dictionary support I've implemented. I'm aware that there are several things left to be done, but as it works for me, it probably takes someone else to finish the support (but I'm willing to help as time permits).


I know that the Kindle app doesn't allow users to select user dictionaries anyway, but it is possible to patch the ASIN number of a user dictionary so that it matches the ASIN of one of the 5 official dictionaries.


Yep, that's correct and what I've done to use the optimized dictionary.

By the way I was very surprised to see that the unmodified dictionary works great on my new Kindle 3 (keyboard), it seems that the kindle firmware removes unnecessary formatting when displaying a dicitionary entry in the popup window, while the kindle app doesn't.


IMHO, it doesn't make much sense to convert a dictionary to a Mobipocket ebook because the user looses the dictionary functionality.


Converting from what?


Please have a look at the original .html source file and the one that the script re-creates and you'll see that they differ significantly and I'm not talking about whitespace characters and line-breaks.

I haven't done that yet. Can you give some examples of what's different? If you could find out what has to be fixed to make it work, someone (me?) might fix the mobiunpack script.

Ciao,
Steffen

Doitsu
10-29-2011, 10:08 AM
You still didn't reveal which application you use to view the dictionary...
Because it doesn't really matter. I use dictionaries primarily on my Kindle 3 and with Mobipocket Reader. I also use the Kindle app on my iPhone.

A version number < 1.0 might give the hint that we don't see the script as finished ;)
I'm well aware of the fact that reverse engineering takes time and never said that I expected a perfect script.

Converting from what?
The reverse engineered source files.

Can you give some examples of what's different?
I believe it would be much easier and faster if you simply had a look at the source files. Since my very simple proof-of-concept .html source file only contains 7 dictionary definitions, it shouldn't be too complicated.

Keep up the good work!

sourcejedi
12-10-2011, 03:56 PM
[If this should be a new thread, please do ask mods to move it]

This is not a support request. Just to let you know I noticed a round-trip failure using mobiunpack, kindlegen 1.2 for linux, and a Mobipocket edition of one of the Young Wizards books. I'm curious whether this is a known bug.

I unpacked it, edited the "HTML", and invoked Kindlegen on the OPF file. (That's generall expected to work, right?) No problem so far; FBReader seemed happy with the new MOBI file.

But then I tried to verify it by unpacking the new MOBI and checking for differences. This happened -
<p height="0pt" width="0pt" align="justify"><a filepos=0000008568 ><font color="blue"><u>Consultations</u></font></a></p>

i.e. a number of links are output as filepos= instead of href= - here's the original:
<p height="0pt" width="0pt" align="justify"><a href="#filepos8519"><font color="blue"><u>Consultations</u></font></a></p>

I double-checked the new MOBI in FBReader, and that specific link is working fine. mobiunpack does seem to find the matching anchor; it's just the links that have gone weird.
<mbp:pagebreak/></div><div><a id="filepos8568" /><a id="filepos8568" />
<p height="1em" width="0pt" align="center"><font size="5"><b><font color="red"> Consultations</font></b></font></p>

ISTR hearing that having multiple anchors next to each other in MobiPocket can be bad news... I think mobiunpack generates them because there are two links to the same location (from two different tables of contents)... but if that were the problem, I'd have thought it would show up as KindleGen dying, or a loss of functionality in the MOBI file, which hasn't happened...

FULL DISCLOSURE. The original MOBI also includes some "dead links" (href="../Text/#filepos6634"). After the round-trip, these appear as filepos=XXXXXXXX. So, it's possible these dead links are confusing mobiunpack, although I'm not sure how. [KindleGen warns "Warning(prcgen): Hyperlink not resolved", but continued anyway. I don't see any other warnings. Ideally mobiunpack would provide a similar warning during unpacking, so you can tell something odd has happened.]

Second disclosure. From the above evidence, I believe that the "original" MOBI has already gone through at least one MOBI->EPUB->MOBI conversion. (Presumably edited in Sigil in between). I have a copy of what I assume is the EPUB version. The EPUB also has "calibre" written all over it (class="calibre"). So it's quite possible the MOBI I started with was generated by Calibre's reverse-engineered code, as opposed to the official MobiPocket/Kindle conversion code.

pdurrant
12-11-2011, 11:30 AM
[If this should be a new thread, please do ask mods to move it]

This is not a support request. Just to let you know I noticed a round-trip failure using mobiunpack, kindlegen 1.2 for linux, and a Mobipocket edition of one of the Young Wizards books. I'm curious whether this is a known bug.


Whether this is a problem in MobiUnpack, KindleGen, or the original file will take quite a lot of detective work.

The first thing to do would be to enable the raw output in MobiUnpack, and see if the duplicate destination markers are present in that.

Looking at the raw output will also help to check whether the problem happens in Mobiunpack (in the conversion to HTML links) or in KindleGen.

When I have some spare time, I might take a look at this, but I can't at the moment. It sounds like you're a pretty good hand at this - why not continue the investigative work yourself?

Oh - and one thing to do would be to continue the Mobiunpack/KindleGen/Mobiunpack sequence a few times, and see if things keep on changing and getting worse.

sourcejedi
12-11-2011, 01:00 PM
Done. [Attached zip: mobiunpack.py for testers; patch for developers].

You probably couldn't see the problem in the html I posted even if you tried, because I foolishly neglected to use CODE tags. The real problem was an extra space character between "<a" and "filepos=".

mobiunpack doesn't say anything about "filepos=XXXXXXXX", so that must have come from KindleGen. (Although it could still be useful to warn about non-numeric filepos values).

DiapDealer
12-11-2011, 01:33 PM
I see you've patched the 0.29 version of mobiunpack.py. Is that the version you were using when you discovered the issue?

I only ask because v0.32 of mobiunpack.py (the latest can always be found in post #5 (http://www.mobileread.com/forums/showpost.php?p=774836&postcount=5) of this thread) seems to have an updated regex pattern that would seem to achieve the same result as the regex in your patch:

From v0.32
link_pattern = re.compile(r'''<[^<>]+filepos=['"]{0,1}(\d+)[^<>]*>''', re.IGNORECASE)

From your patch
link_pattern = re.compile(r'''<a[ ]+filepos=['"]{0,1}0*(\d+)['"]{0,1} *>''', re.IGNORECASE)

Have you tried v0.32 to see if this issue might be a non-starter?

sourcejedi
12-11-2011, 04:40 PM
Sorry, yes. 0.32 from this thread works correctly. I was using the version from Siebert's git repo which describes itself as 0.29.
Thanks for pointing it out.

I'm probably used to assuming 'git' means 'the latest version'. But that's not true in general, and I should have said where I got the program from.

[Nitpick: I think you quoted the wrong link_pattern - there's two of them, and the first appears unchanged. The relevant one has your name next to it in 0.32 :).

# Two different regex search and replace routines.
# Best results are with the second so far IMO (DiapDealer).

#link_pattern = re.compile(r'''<a filepos=['"]{0,1}0*(\d+)['"]{0,1} *>''', re.IGNORECASE)
link_pattern = re.compile(r'''<a\s+filepos=['"]{0,1}0*(\d+)['"]{0,1}(.*?)>''', re.IGNORECASE)
#srctext = link_pattern.sub(r'''<a href="#filepos\1">''', srctext)
srctext = link_pattern.sub(r'''<a href="#filepos\1"\2>''', srctext)

DiapDealer
12-11-2011, 05:29 PM
[Nitpick: I think you quoted the wrong link_pattern - there's two of them, and the first appears unchanged. The relevant one has your name next to it in 0.32 :)

That's not nitpicking... that's flat-out busting me for taking such a cursory glance at the code. :D

KevinH
12-16-2011, 10:18 AM
Hi All,

You should check out the following links to get copies of the new amazon k8 format files to play around with and test with:

http://www.the-digital-reader.com/2011/12/13/kindle-format-8-demo-now-available/

I grabbed the Jerome.mobi and tried unpacking it via mobiunpack.py with all DEBUG turned on.

It seems that Amazon have simply combined two different mobi ebooks into one palm doc container.

The one at the top is simply the normal mobi and mobiunpack works well on it but it generates extra raw pieces. You can find all of these extra raw pieces hidden away as image*.raw files inside the images folder. These include FONT and RESC files plus copies of each section in its own file until the end of the palm doc. So by examining these extra image*.raw files in a text editor we can see what each section of the palmdoc contains.

Immediately after the normal mobi ebook (in the very next section) you can find a whole section that appears to be nothing but the word "BOUNDARY" which seems to be the divider between the older .mobi file format and the new format.

It is followed by what looks like a new section 0 mobi header, and that is followed by all of the raw .xhtml in each section until the end (but unlike true image sections these has been compressed so we will need to uncompress them to see what the new xhtml looks like. So the old format mobi is at the top of the palmdoc container and immediately after the images and FLIS, FCIS (the images appear to by shared by both versions of the ebook) you can see the pieces that make up the new format.

So it appears we can look for things in the first mobi header that indicates that that a KF8 style data is included, and then parse those records using the new section 0 very much like we process the original mobi.

So anyone want to take a shot to modify the latest mobiunpack to unpack both versions of the files for these new K8s?

Volunteers welcome!

DaleDe
12-16-2011, 11:07 AM
The second entry is the source file I believe, generally an ePub exactly duplicated. Or are you talking about some other data?

KevinH
12-16-2011, 11:17 AM
Hi,

No there is a separate section for the source zip file as well. No we are talking about the a the k8 version of the ebook packed immediately after the normal mobi one in one palmdoc container.

Grab version 0.32 of mobiunpack, edit it with a text editor to set DEBUG = True and run it on that K8 ebook and examine the extra .raw sections stored under debug mode inside of the image folder to see what I am referring to.

KevinH
12-23-2011, 12:59 PM
Hi,

Just in case anyone wants to play around with the latest K8 .mobi files, I have attached a newest_mobi_unpack.zip

I made massive changes and reorganized everything and split it into many different files and then renamed it to mobi_unpack to prevent confusion.

This is very experimental and probably will not work for you.

But if you want to play around, download and unzip it. Copy the test Jerome.mobi (see earlier link) into that directory. Change to that directory and then run:

python ./mobi_unpack.py Jerome.mobi test/

(or whatever the windows equivalent is if you are on windows)

If it works, inside of test you should see the original mobi info, a K8 folder that has the new K8 xhtml files, and a Jerome.epub which is the epub created from the new K8 files.

You should also see a kindlegensrc.zip file which represents the original epub that was used to generate the Jerome.mobi which you can unzip and compare against the files in the K8 folder or the Jerome.epub.

Please report any difficulties so we can fix any bugs.

Happy Holidays!

KevinH

KevinH
12-24-2011, 09:46 PM
FYI:
DiapDealer found and fixed a number of bugs in the new mobi_unpack program for K8 files.
Thanks to DiapDealer!

So if anyone wants the updated version, check out my later posts in this thread to find the very latest version.


KevinH

lizcastro
01-12-2012, 12:25 PM
Thanks, Kevin! This is so helpful.

Can you confirm that the only thing mobi_unpack does is show what was in the mobi file? It doesn't generate anything, right?

When I convert an EPUB file to mobi with KindleGen2, and then unpack it with your latest version of mobi_unpack, I get a folder that contains a smaller version of the EPUB file than the original, an HTML file with what looks like the contents of the entire book, along with an ncx and opf file, and a folder with reduced size images.

Then, there's a K8 folder that contains a completely re-engineered set of files, all renamed, resized images, etc. of what was originally in my EPUB file.

And then there's a kindlegensrc.zip file, that when unzipped, contains my original unaltered files.

It all seems so excessive.

thanks,
Liz

pdurrant
01-12-2012, 12:45 PM
Thanks, Kevin! This is so helpful.

Can you confirm that the only thing mobi_unpack does is show what was in the mobi file? It doesn't generate anything, right?

When I convert an EPUB file to mobi with KindleGen2, and then unpack it with your latest version of mobi_unpack, I get a folder that contains a smaller version of the EPUB file than the original, an HTML file with what looks like the contents of the entire book, along with an ncx and opf file, and a folder with reduced size images.

Then, there's a K8 folder that contains a completely re-engineered set of files, all renamed, resized images, etc. of what was originally in my EPUB file.

And then there's a kindlegensrc.zip file, that when unzipped, contains my original unaltered files.

It all seems so excessive.

thanks,
Liz

The only new thing that Mobiunpack creates is the epub, which is generated from the K8 folder. The HTML, ncx, opf and folder of images are the mobipocket version, the K8 is the new Kindle Format 8 version and the kindlegensrc.zip are indeed your original files which are also in the Mobipocket file.

Yes, the output from the new KindleGen does contain the Mobipocket, KF8 and your source files, all wrapped up in one.

KevinH
01-12-2012, 12:48 PM
Hi Liz,

It does unpack and generate things so that the end user could edit the files and drop them back on kindlegen to recreate a modified mobi.

The new kindlegen creates mobis (palm database files) that actually have two completely different versions of the ebook inside it (and I am not referring to the kindlegensrc.zip which may also stored there).

The first is the original mobi format ebook and immeidately after it is the new K8 mobi ebook all stored in the same .mobi palm database file.

So older technology can read the .mobi file from the top and see it as a normal mobi. Newer technology can then detect that this is a compound mobi file and actually open the second half which is the K8 formatted (html5 - basically a variation of an epub) to get all of the new features.

Right now, mobi_unpack.py will create in the output folder the following:

1. from the old part of the .mobi it will create the source mobi markup (old html) and images that will allow the user to edit it any way they want and drop it back on kindlegen.

2. if the kindlegensrc.zip record is present it will unpack it so that the user can see the actual source ebook file (typically an epub) given to kindlegen. This record is typically removed by Amazon but is actually created by Kindlegen.

3. from the K8 version of the .mobi, it will create the K8 folder and inside it all of the images and fonts, and xhtml source files that were used to create it. A user who did not have access to the kindlegensrc.zip could edit this and then drop it on kindlegen to create a new/altered version of the ebook (fix typos, etc).

4. From the K8 pieces, it actually will build a complete epub which is stores as well.

You can then compare the epub created from the K8 against the kindlegensrc.zip (typically an epub) to see what is anything the kindlegen processing changed.

All of this requires rebuilding and generation. The actual binary format inside of the mobi file needs to be decoded to make something that is usable in some way. If you want to see what the actual raw files look like, you can use NotePad+ or any good text editor to change one line near the top of mobi_unpack.py that will write out all of the raw text pieces as well.

So it is simply not something that dumps sections from the palm database file. It actually does that (the raw file) and then rebuilds it to try to get back to the original source so that authors and people can more easily edit their books and recreate mobi output using Kindlegen.

It is also useful for understanding the internal format of the new .k8 mobis and what if any tags are created and used.

If you have any other questions just ask.

Take care,

Kevin




Thanks, Kevin! This is so helpful.

Can you confirm that the only thing mobi_unpack does is show what was in the mobi file? It doesn't generate anything, right?

When I convert an EPUB file to mobi with KindleGen2, and then unpack it with your latest version of mobi_unpack, I get a folder that contains a smaller version of the EPUB file than the original, an HTML file with what looks like the contents of the entire book, along with an ncx and opf file, and a folder with reduced size images.

Then, there's a K8 folder that contains a completely re-engineered set of files, all renamed, resized images, etc. of what was originally in my EPUB file.

And then there's a kindlegensrc.zip file, that when unzipped, contains my original unaltered files.

It all seems so excessive.

thanks,
Liz

lizcastro
01-12-2012, 01:10 PM
Fascinating! Thanks so much for the info. And for mobi_unpack itself.

I find the fact that the mobi file contains a non-KF8 version, a KF8 version AND the original EPUB particularly interesting.

And I hate the way all the files get renamed! I assume that's KindleGen and not mobi_unpack.

Are either of you on Twitter? I'd love to follow you.

best,
Liz

KevinH
01-12-2012, 02:08 PM
Hi Liz,

Inside the .mobi there are no file names at all. Each font, image, etc is just stored in section of the database (with no name info) and referred to from the processed html (i.e all links are converted to section numbers in the .mobi palm database).

So all "names" are created by us (either based on the title) or simply numbered with img0001.jpg, font0002.ttf, part0004.xhtml, etc. We have no way of knowing what the original name was, whether it was a chapter, or section or ....

That is the main reason we need to re-generate things. Even in the older mobis, the mobi markup html that was input to kindlegen was processed to remove links, store images in sections, etc, and so we must reverse that to get back to something that can be edited by users.

As for twitter - I am too old to deal with anything new ;-)

But I am sure Paul, or DiapDealer or any of the other contributors from this forum topic (mobi_unpacker is really the joint effort of a lot of people) would be happy to answer any questions.

Take care,

KevinH



And I hate the way all the files get renamed! I assume that's KindleGen and not mobi_unpack.

Are either of you on Twitter? I'd love to follow you.

best,
Liz

lizcastro
01-12-2012, 02:18 PM
Whoa. I didn't realize. I sort of knew that mobi was this big mass of data, but didn't realize to what extent. So, if I understand correctly, mobi_unpack reverse engineers the mobi and then generates what the individual files would look like if they were individual files?

So it's not KindleGen that renames them, it's mobi_unpack, but it does so because it has no other choice, since the names are lost in the conversion to mobi?

But the kindlegensrc.zip file actually comes from a real, existing EPUB that's sitting there in the mobi file created by KindleGen?

Going to set WRITE_RAW_DATA to True now to see what happens.

thanks!

Liz

lizcastro
01-12-2012, 02:31 PM
Why does mobi_unpack generate an EPUB file?

DiapDealer
01-12-2012, 02:45 PM
Why does mobi_unpack generate an EPUB file?
I will defer to Kevin for the final say on this question, but for myself... mobi_unpack generates an epub because the KF8 format itself is basically nothing more than a binary representation of an epub.

So since the original source won't be part of a commercially available, DRM-Free KF8 ebook, mobi_unpack decompiles the KF8 data into a familiar, standard, editable format that can be easily modified (or examined) with existing tools/programs and then fed right back to kindlegen.

KevinH
01-12-2012, 03:07 PM
Hi,

Yes, exactly as DiapDealer said!

It is nice to have the kindlegensrc.zip but ebooks downloaded from Amazon won't have that. Amazon strips it off (and if they keep it they could start selling epubs if they ever wanted to as well).

So mobi_unpacker tries to recreate the original epub as close as it can be based on the K8 information (which is xhtml based with normal css that is essentially an epub with the main bits merged into one file with links replaced and a few other modifications).

Take a look at the _k8.raw file in a text editor to see what the kindlegen actually stores inside. You can find the css info stored at the end (inline) with any svg moved to there as well. You can see how they have replaced links with base 32 numbered references, added their own aid="", etc.

The mobi_unpacker figures out how to reverse all of that to get back to as close to an epub as possible since that is the input format for kindlegen.

Take care,

Kevin


I will defer to Kevin for the final say on this question, but for myself... mobi_unpack generates an epub because the KF8 format itself is basically nothing more than a binary representation of an epub.

So since the original source won't be part of a commercially available, DRM-Free KF8 ebook, mobi_unpack decompiles the KF8 data into a familiar standard editable format that can be easily modified (or examined) with existing tools/programs and then fed right back to kindlegen.

lizcastro
01-12-2012, 03:11 PM
Hmm. I don't see the _k8.raw file. When I used WRITE_RAW_DATA=True, the only thing I got different was a .rawml file, but it looks a lot like the .html file on the non-kf8 side. Should I have modified some other setting?

KevinH
01-12-2012, 03:17 PM
Hi,

Look for a file inside the K8 directory that is named after the title of the book and ends with .rawml (I used to call it _k8.raw but then moved it to inside the K8 so that it would not impact the raw version from the older mobi part of the ebook).

You should find the css at the end, links changed, aid="" placed in tags to augment the original id="", etc.

For fun you can look at the .rawml version outside of the K8 directory. It is how the original mobi markup language got processed by kindlegen. Check out the links, how styles are inlined, etc.