KindleUnpack (MobiUnpack): Extracts text, images and metadata from Kindle/Mobi files - Page 11

siebert · 09-10-2011, 04:04 PM

Quote:

Originally Posted by pdurrant

I think that unknown Metadata should only be showing as a warning. There is almost always some unknown metadata, as the Mobipocket/Print Replica file format is undocumented.

Hi,

in my version I've already introduced a list of "known unknown" metadata (means that we know that these values exist, but we don't know the meaning) and mobiunpack complains only if an unknown value isn't in this list.

I hope I'll find time to release my version soon

Ciao,
Steffen

Anjelous · 09-10-2011, 04:11 PM

Edited! Ok to delete this post as I found another thread that better answers my question

fandrieu · 09-12-2011, 09:05 AM

Hello everybody.

First I'd like to thank the community for all the good work, without the homebrew tools my experience with the mobi file format and the kindle as a whole would'nt have been nearly as nice !

Back on topic, I first came here to ask if somebody's maintaining mobiunpack.py / accepting patches, but reading the last few post it would seem that both pdurrant and siebert are working on a branch, am I right ?
If so could I contribute ?

...

Also in the last few posts there were talks about extracting the NCX from mobi files, it just so happens that's the very feature I've been toying with this weekend and made me come here today

At this time I got a (pretty awful) proof of concept code that can extract flat "chapter only" NCX, I got the necessay clues from the "writer" part of the calibre mobi module, I could elaborate on that if somebody's interested...

Apart from that I made some corrections (like the encoding header in the html, which appears to be in siebert's branch) and also have an alternate "Adding anchors..." code that reconstructs all anchors, even when they're not referenced, and should avoid adding anchors in the <head> (a bug i encountered with some files).

I was also interested in re-factoring the code to be more readable / workable (this also appears to be in siebert's plans

).

I started with the (pdurrant's ?) version @ http://code.google.com/, but wouldn't mind switching...

KevinH · 09-12-2011, 09:16 AM

Hi,

Great! The more the merrier. If you look through this topic you will find links to later versions than what we (pdurrant and I) hosted on code.google.com - we have not bothered to update that site lately. Yes, you are right siebert has added support for Dictionaries and made some major speed improvements. I have added code to spit out more of the metadata so that the tool can be used to investigate more about what each metadata means (for example we recently found what we think is the expiration date), and pdurrant has added support for non-drm versions of the .azw4 format.

Simply walk through this thread and grab the very latest version of mobiunpack.py that you see and use that as your starting point. I believe you want mobiunpack.py version 0.31 posted by pdurrant a few days ago to this thread. siebert may have an even newer version but I don't think he has posted it yet. Let me know if you can't find it and I will post it again for you.

KevinH

pdurrant · 09-12-2011, 09:16 AM

I'm afraid the google code versions is very much out of date. The current version is 0.31, which can be found in this thread here.

While we really ought to be using version control software, in a clever shared manner, at present we seem to be posting updates here, which I copy back to the fifth post in this thread.

Some ncx generation code would be welcome. I posted a sample of the binary data representing an ncx, along with the source ncx file, here.

Any other changes would be interesting to see too.

DiapDealer · 09-12-2011, 09:17 AM

I'd be very interested in the seeing the NCX extraction code you've come up with. I use calibre to convert epubs to mobi, and then feed the output of mobiunpack to kindlegen... so not having to rebuild the NCX by hand each time would be very welcome indeed.

siebert · 09-12-2011, 11:54 AM

Quote:

Originally Posted by DiapDealer

I use calibre to convert epubs to mobi, and then feed the output of mobiunpack to kindlegen...

Calibre had a debug option --kindlegen which used the kindlegen binary to build the mobi file.

Kovid removed that option a few days ago because he doesn't like me and my request to make that option selectable via the gui (though I'm obviously not the only person liking that feature), but if you are willing to use either an older or a modified version of calibre you don't need the mobiunpack step.

Ciao,
Steffen

fandrieu · 09-12-2011, 12:14 PM

Thank you very much for all your quick replies !

I just downloaded v31 from the link you provided and finished retro-fitting the modifications I had made to v23.

I'll try to explain why / how I did those changes later but first, as code speaks louder than words, here's the file.

....

Just a few words:
* I just finished merging, it's not tested

* The NCX part is really a proof of concept, it does however produce an acceptable output on my test files with flat NCX.
It consists of:
- a code block with 3 methods just before unpackBook
- a main code block in unpackBook, enclosed by "#TEST NCX"
- a small mod to the OPF code, to add a ref to the NCX

* Other than there's some "empirical" changes I made while testing some files:
- FILEPOS_ON_ALL_ANCHORS: an option to use an alternate code that processes all empty anchors instead of focusing on existing links...
- replaced a " " by "\s+" in the "Insert hrefs into html" rx...
- alternate way to set the html file encoding

EDIT: sorry, the file i uploaded contained several fatal errors i failed to spot.
EDIT: the new file should work at least with calibre-generated mobis...

EDIT2: added a text file describing what I gathered of the NCX equivalent in MOBI

EDIT3: basic fixes to the code...

DiapDealer · 09-12-2011, 12:48 PM

Quote:

Originally Posted by siebert

Calibre had a debug option --kindlegen which used the kindlegen binary to build the mobi file.

Kovid removed that option a few days ago because he doesn't like me and my request to make that option selectable via the gui (though I'm obviously not the only person liking that feature), but if you are willing to use either an older or a modified version of calibre you don't need the mobiunpack step.

How would that be any different than feeding the epub directly to kindlegen? My reason for using calibre as an intermediate step is because calibre does a much better job of translating/flattening an ePub's CSS into a mobi that more accurately reflects the original (visibly) than kindlegen currently does. Kindlegen can then take the mobiunpack output and create the final mobi (with the approved tools). Or am I missing something?

KevinH · 09-12-2011, 01:58 PM

Hi,

Thanks for posting! I grabbed it and tried it on a bunch of mobis I had and unfortunately, the internal links anchors from many of the internal links in the document no longer work. I tested it with mobiunpack version 31 without your changes and all internal links worked.

So somehow your changes have broken some of the internal links.
I will try to track this down.

I did get some form of NCX file but it was incomplete and there were error messages:

Write html
ERROR: last byte not 0x80
ERROR: text not found 1354424
Wite ncx
Write opf

I will keep playing with it to see if I get get the internal links working again.

Thanks for getting this ncx stuff going!

KevinH

Quote:

Originally Posted by fandrieu

Thank you very much for all your quick replies !

I just downloaded v31 from the link you provided and finished retro-fitting the modifications I had made to v23.

I'll try to explain why / how I did those changes later but first, as code speaks louder than words, here's the file.

....

Just a few words:
* I just finished merging, it's not tested

* The NCX part is really a proof of concept, it does however produce an acceptable output on my test files with flat NCX.
It consists of:
- a code block with 3 methods just before unpackBook
- a main code block in unpackBook, enclosed by "#TEST NCX"
- a small mod to the OPF code, to add a ref to the NCX

* Other than there's some "empirical" changes I made while testing some files:
- FILEPOS_ON_ALL_ANCHORS: an option to use an alternate code that processes all empty anchors instead of focusing on existing links...
- replaced a " " by "\s+" in the "Insert hrefs into html" rx...
- alternate way to set the html file encoding

EDIT: sorry, the file i uploaded contained several fatal errors i failed to spot.
EDIT: the new file should work at least with calibre-generated mobis...

pdurrant · 09-12-2011, 02:03 PM

Quote:

Originally Posted by DiapDealer

How would that be any different than feeding the epub directly to kindlegen? My reason for using calibre as an intermediate step is because calibre does a much better job of translating/flattening an ePub's CSS into a mobi that more accurately reflects the original (visibly) than kindlegen currently does. Kindlegen can then take the mobiunpack output and create the final mobi (with the approved tools). Or am I missing something?

I believe the --kindlegen option did the usual conversion to mobipocket-specific HTML, but them used kidnlegen to compile it into an actual mobipocket file rather than calibre's own mobipocket file generation code.

DiapDealer · 09-12-2011, 02:29 PM

Quote:

Originally Posted by pdurrant

I believe the --kindlegen option did the usual conversion to mobipocket-specific HTML, but them used kidnlegen to compile it into an actual mobipocket file rather than calibre's own mobipocket file generation code.

Ahhh, ok, that makes sense. Thanks.

KevinH · 09-12-2011, 02:32 PM

Hi,

Your link_pattern used if FILEPOS_ON_ALL_ANCHORS is True seems to be a bit broken:

For example: here is what the rawml says for one link:

<a filepos=0000006414 >M<span><font size="2">APS</font></span></a>

but this link is never properly detected or processed by your link pattern:

link_pattern = re.compile(r'''<a\s*(></a>|/>)''', re.IGNORECASE)

So you might want to take another look at your link patterns to make sure rawml of this type gets processed properly.

Hope this helps,

KevinH

Quote:

Originally Posted by KevinH

Hi,

Thanks for posting! I grabbed it and tried it on a bunch of mobis I had and unfortunately, the internal links anchors from many of the internal links in the document no longer work. I tested it with mobiunpack version 31 without your changes and all internal links worked.

So somehow your changes have broken some of the internal links.
I will try to track this down.

I did get some form of NCX file but it was incomplete and there were error messages:

Write html
ERROR: last byte not 0x80
ERROR: text not found 1354424
Wite ncx
Write opf

I will keep playing with it to see if I get get the internal links working again.

Thanks for getting this ncx stuff going!

KevinH

KevinH · 09-12-2011, 04:50 PM

Hi,

Okay I looked more at this index material. It appears the "type" information is key to understanding how to read in the indx information.

For example:

To correctly parse the indx entries, I had to do something like the following:

if type == 0x1f:
# handle next two variable width unknowns
pos, unk1 = getVariableWidthValue(navdata,offset)
offset += pos
print "unknown 1 is ", unk1
pos, unk2 = getVariableWidthValue(navdata,offset)
offset += pos
print "unknown 2 is ", unk2
if type == 0xdf:
# handle next threee variable width unknowns
pos, unk1 = getVariableWidthValue(navdata,offset)
offset += pos
print "unknown 1 is ", unk1
pos, unk2 = getVariableWidthValue(navdata,offset)
offset += pos
print "unknown 2 is ", unk2
pos, unk3 = getVariableWidthValue(navdata,offset)
offset += pos
print "unknown 3 is ", unk3
pos, unk4 = getVariableWidthValue(navdata,offset)
offset += pos
print "unknown 4 is ", unk4
if type == 0x3f:
# handle next threee variable width unknowns
pos, unk1 = getVariableWidthValue(navdata,offset)
offset += pos
print "unknown 1 is ", unk1
pos, unk2 = getVariableWidthValue(navdata,offset)
offset += pos
print "unknown 2 is ", unk2
pos, unk3 = getVariableWidthValue(navdata,offset)
offset += pos
print "unknown 3 is ", unk3

and then there is no need to look for or skip 0x80 values.

Also the count is not the same as the number of entries in the CTOC.

From my set of ebooks, the CTOC data always ends with '\0\0' double null bytes and it has variable length.

So I have attached a mobiunpack_test.py program that modifies things to work with a real amazon mobi ebook (as opposed to calibre generated ones).

Perhaps this might help others trying to track things down.

I am going to try and figure out what each of these unknowns actually means.

Hope this helps,

KevinH

siebert · 09-12-2011, 05:36 PM

Quote:

Originally Posted by KevinH

Okay I looked more at this index material. It appears the "type" information is key to understanding how to read in the indx information.

The various indexes seem to be very similar in mobi, so the ncx handling code should be able to reuse a lot of my code for the inflection index.

INDX0 is the meta index and the TAGX section can be parsed with readTagSection(). INDX1 is the actual index data, and the CTOC data is like the inflNameData.

Ciao,
Steffen

09-10-2011, 04:11 PM	#152
Anjelous Junior Member Posts: 2 Karma: 10 Join Date: Sep 2011 Device: iPad	Edited! Ok to delete this post as I found another thread that better answers my question Last edited by Anjelous; 09-10-2011 at 05:02 PM.

09-12-2011, 09:05 AM	#153
fandrieu Member Posts: 11 Karma: 10 Join Date: Sep 2011 Device: kindle 3	mobiunpack modifications Hello everybody. First I'd like to thank the community for all the good work, without the homebrew tools my experience with the mobi file format and the kindle as a whole would'nt have been nearly as nice ! Back on topic, I first came here to ask if somebody's maintaining mobiunpack.py / accepting patches, but reading the last few post it would seem that both pdurrant and siebert are working on a branch, am I right ? If so could I contribute ? ... Also in the last few posts there were talks about extracting the NCX from mobi files, it just so happens that's the very feature I've been toying with this weekend and made me come here today At this time I got a (pretty awful) proof of concept code that can extract flat "chapter only" NCX, I got the necessay clues from the "writer" part of the calibre mobi module, I could elaborate on that if somebody's interested... Apart from that I made some corrections (like the encoding header in the html, which appears to be in siebert's branch) and also have an alternate "Adding anchors..." code that reconstructs all anchors, even when they're not referenced, and should avoid adding anchors in the <head> (a bug i encountered with some files). I was also interested in re-factoring the code to be more readable / workable (this also appears to be in siebert's plans ). I started with the (pdurrant's ?) version @ http://code.google.com/, but wouldn't mind switching...

09-12-2011, 04:50 PM	#164
KevinH Sigil Developer Posts: 7,650 Karma: 5433388 Join Date: Nov 2009 Device: many	index support Hi, Okay I looked more at this index material. It appears the "type" information is key to understanding how to read in the indx information. For example: To correctly parse the indx entries, I had to do something like the following: if type == 0x1f: # handle next two variable width unknowns pos, unk1 = getVariableWidthValue(navdata,offset) offset += pos print "unknown 1 is ", unk1 pos, unk2 = getVariableWidthValue(navdata,offset) offset += pos print "unknown 2 is ", unk2 if type == 0xdf: # handle next threee variable width unknowns pos, unk1 = getVariableWidthValue(navdata,offset) offset += pos print "unknown 1 is ", unk1 pos, unk2 = getVariableWidthValue(navdata,offset) offset += pos print "unknown 2 is ", unk2 pos, unk3 = getVariableWidthValue(navdata,offset) offset += pos print "unknown 3 is ", unk3 pos, unk4 = getVariableWidthValue(navdata,offset) offset += pos print "unknown 4 is ", unk4 if type == 0x3f: # handle next threee variable width unknowns pos, unk1 = getVariableWidthValue(navdata,offset) offset += pos print "unknown 1 is ", unk1 pos, unk2 = getVariableWidthValue(navdata,offset) offset += pos print "unknown 2 is ", unk2 pos, unk3 = getVariableWidthValue(navdata,offset) offset += pos print "unknown 3 is ", unk3 and then there is no need to look for or skip 0x80 values. Also the count is not the same as the number of entries in the CTOC. From my set of ebooks, the CTOC data always ends with '\0\0' double null bytes and it has variable length. So I have attached a mobiunpack_test.py program that modifies things to work with a real amazon mobi ebook (as opposed to calibre generated ones). Perhaps this might help others trying to track things down. I am going to try and figure out what each of these unknowns actually means. Hope this helps, KevinH Last edited by KevinH; 09-15-2011 at 06:55 PM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Can i rotate text and insert images in Mobi and EPUB?	JanGLi	Kindle Formats	5	02-02-2013 04:16 PM
PDF to Mobi with text and images	pocketsprocket	Kindle Formats	7	05-21-2012 07:06 AM
Mobi files - images	DWC	Introduce Yourself	5	07-06-2011 01:43 AM
pdf to mobi... creating images rather than text	Dumhed	Calibre	5	11-06-2010 12:08 PM
Transfer of images on text files	anirudh215	PDF	2	06-22-2009 09:28 AM

09-12-2011, 09:16 AM	#154
KevinH Sigil Developer Posts: 7,650 Karma: 5433388 Join Date: Nov 2009 Device: many	Hi, Great! The more the merrier. If you look through this topic you will find links to later versions than what we (pdurrant and I) hosted on code.google.com - we have not bothered to update that site lately. Yes, you are right siebert has added support for Dictionaries and made some major speed improvements. I have added code to spit out more of the metadata so that the tool can be used to investigate more about what each metadata means (for example we recently found what we think is the expiration date), and pdurrant has added support for non-drm versions of the .azw4 format. Simply walk through this thread and grab the very latest version of mobiunpack.py that you see and use that as your starting point. I believe you want mobiunpack.py version 0.31 posted by pdurrant a few days ago to this thread. siebert may have an even newer version but I don't think he has posted it yet. Let me know if you can't find it and I will post it again for you. KevinH

09-12-2011, 09:16 AM	#155
pdurrant The Grand Mouse 高貴的老鼠 Posts: 71,510 Karma: 306214458 Join Date: Jul 2007 Location: Norfolk, England Device: Kindle Voyage	I'm afraid the google code versions is very much out of date. The current version is 0.31, which can be found in this thread here. While we really ought to be using version control software, in a clever shared manner, at present we seem to be posting updates here, which I copy back to the fifth post in this thread. Some ncx generation code would be welcome. I posted a sample of the binary data representing an ncx, along with the source ncx file, here. Any other changes would be interesting to see too.

09-12-2011, 09:17 AM	#156
DiapDealer Grand Sorcerer Posts: 27,552 Karma: 193191846 Join Date: Jan 2010 Device: Nexus 7, Kindle Fire HD	I'd be very interested in the seeing the NCX extraction code you've come up with. I use calibre to convert epubs to mobi, and then feed the output of mobiunpack to kindlegen... so not having to rebuild the NCX by hand each time would be very welcome indeed.