KindleUnpack (MobiUnpack): Extracts text, images and metadata from Kindle/Mobi files - Page 16

Doitsu · 10-29-2011, 08:38 AM

Quote:

Originally Posted by siebert

What exactly do you mean with "no longer works as a dictionary"?

It no longer works as a lookup dictionary. I.e. it should work exactly as the original.

Quote:

Originally Posted by siebert

My focus was to use the recompiled dictionary in the kindle app [...]

I assumed the objective of the mobipunpack.py developers was to re-create the original source files as good as possible so that you could theoretically unpack a dictionary correct an entry and recompile it without any loss of functionality. Currently this doesn't seem to work.

I know that the Kindle app doesn't allow users to select user dictionaries anyway, but it is possible to patch the ASIN number of a user dictionary so that it matches the ASIN of one of the 5 official dictionaries.

IMHO, it doesn't make much sense to convert a dictionary to a Mobipocket ebook because the user looses the dictionary functionality.

Quote:

Originally Posted by siebert

[...]so I removed the javascript code before recompile the dictionary).

AFAIK, Javascript code is not required in Mobipocket dictionary .html source files and wasn't present in my dictionary .html source file before it was compiled.

Please have a look at the original .html source file and the one that the script re-creates and you'll see that they differ significantly and I'm not talking about whitespace characters and line-breaks.

siebert · 10-29-2011, 09:27 AM

Quote:

Originally Posted by Doitsu

It no longer works as a lookup dictionary.

You still didn't reveal which application you use to view the dictionary...

Quote:

I assumed the objective of the mobipunpack.py developers was to re-create the original source files as good as possible so that you could theoretically unpack a dictionary correct an entry and recompile it without any loss of functionality.

Yes, this is the final goal. We know that it's not reached yet. A version number < 1.0 might give the hint that we don't see the script as finished

Quote:

Currently this doesn't seem to work.

My dictionary itch has been scratched by the existing dictionary support I've implemented. I'm aware that there are several things left to be done, but as it works for me, it probably takes someone else to finish the support (but I'm willing to help as time permits).

Quote:

I know that the Kindle app doesn't allow users to select user dictionaries anyway, but it is possible to patch the ASIN number of a user dictionary so that it matches the ASIN of one of the 5 official dictionaries.

Yep, that's correct and what I've done to use the optimized dictionary.

By the way I was very surprised to see that the unmodified dictionary works great on my new Kindle 3 (keyboard), it seems that the kindle firmware removes unnecessary formatting when displaying a dicitionary entry in the popup window, while the kindle app doesn't.

Quote:

IMHO, it doesn't make much sense to convert a dictionary to a Mobipocket ebook because the user looses the dictionary functionality.

Converting from what?

Quote:

Please have a look at the original .html source file and the one that the script re-creates and you'll see that they differ significantly and I'm not talking about whitespace characters and line-breaks.

I haven't done that yet. Can you give some examples of what's different? If you could find out what has to be fixed to make it work, someone (me?) might fix the mobiunpack script.

Ciao,
Steffen

Doitsu · 10-29-2011, 11:08 AM

Quote:

Originally Posted by siebert

You still didn't reveal which application you use to view the dictionary...

Because it doesn't really matter. I use dictionaries primarily on my Kindle 3 and with Mobipocket Reader. I also use the Kindle app on my iPhone.

Quote:

Originally Posted by siebert

A version number < 1.0 might give the hint that we don't see the script as finished

I'm well aware of the fact that reverse engineering takes time and never said that I expected a perfect script.

Quote:

Originally Posted by siebert

Converting from what?

The reverse engineered source files.

Quote:

Originally Posted by siebert

Can you give some examples of what's different?

I believe it would be much easier and faster if you simply had a look at the source files. Since my very simple proof-of-concept .html source file only contains 7 dictionary definitions, it shouldn't be too complicated.

Keep up the good work!

sourcejedi · 12-10-2011, 04:56 PM

[If this should be a new thread, please do ask mods to move it]

This is not a support request. Just to let you know I noticed a round-trip failure using mobiunpack, kindlegen 1.2 for linux, and a Mobipocket edition of one of the Young Wizards books. I'm curious whether this is a known bug.

I unpacked it, edited the "HTML", and invoked Kindlegen on the OPF file. (That's generall expected to work, right?) No problem so far; FBReader seemed happy with the new MOBI file.

But then I tried to verify it by unpacking the new MOBI and checking for differences. This happened -

Code:

<p height="0pt" width="0pt" align="justify"><a  filepos=0000008568 ><font color="blue"><u>Consultations</u></font></a></p>

i.e. a number of links are output as filepos= instead of href= - here's the original:

Code:

<p height="0pt" width="0pt" align="justify"><a href="#filepos8519"><font color="blue"><u>Consultations</u></font></a></p>

I double-checked the new MOBI in FBReader, and that specific link is working fine. mobiunpack does seem to find the matching anchor; it's just the links that have gone weird.

Code:

<mbp:pagebreak/></div><div><a id="filepos8568" /><a id="filepos8568" /> 
<p height="1em" width="0pt" align="center"><font size="5"><b><font color="red"> Consultations</font></b></font></p>

ISTR hearing that having multiple anchors next to each other in MobiPocket can be bad news... I think mobiunpack generates them because there are two links to the same location (from two different tables of contents)... but if that were the problem, I'd have thought it would show up as KindleGen dying, or a loss of functionality in the MOBI file, which hasn't happened...

FULL DISCLOSURE. The original MOBI also includes some "dead links" (href="../Text/#filepos6634"). After the round-trip, these appear as filepos=XXXXXXXX. So, it's possible these dead links are confusing mobiunpack, although I'm not sure how. [KindleGen warns "Warning(prcgen): Hyperlink not resolved", but continued anyway. I don't see any other warnings. Ideally mobiunpack would provide a similar warning during unpacking, so you can tell something odd has happened.]

Second disclosure. From the above evidence, I believe that the "original" MOBI has already gone through at least one MOBI->EPUB->MOBI conversion. (Presumably edited in Sigil in between). I have a copy of what I assume is the EPUB version. The EPUB also has "calibre" written all over it (class="calibre"). So it's quite possible the MOBI I started with was generated by Calibre's reverse-engineered code, as opposed to the official MobiPocket/Kindle conversion code.

pdurrant · 12-11-2011, 12:30 PM

Quote:

Originally Posted by sourcejedi

[If this should be a new thread, please do ask mods to move it]

This is not a support request. Just to let you know I noticed a round-trip failure using mobiunpack, kindlegen 1.2 for linux, and a Mobipocket edition of one of the Young Wizards books. I'm curious whether this is a known bug.

Whether this is a problem in MobiUnpack, KindleGen, or the original file will take quite a lot of detective work.

The first thing to do would be to enable the raw output in MobiUnpack, and see if the duplicate destination markers are present in that.

Looking at the raw output will also help to check whether the problem happens in Mobiunpack (in the conversion to HTML links) or in KindleGen.

When I have some spare time, I might take a look at this, but I can't at the moment. It sounds like you're a pretty good hand at this - why not continue the investigative work yourself?

Oh - and one thing to do would be to continue the Mobiunpack/KindleGen/Mobiunpack sequence a few times, and see if things keep on changing and getting worse.

sourcejedi · 12-11-2011, 02:00 PM

Done. [Attached zip: mobiunpack.py for testers; patch for developers].

You probably couldn't see the problem in the html I posted even if you tried, because I foolishly neglected to use CODE tags. The real problem was an extra space character between "<a" and "filepos=".

mobiunpack doesn't say anything about "filepos=XXXXXXXX", so that must have come from KindleGen. (Although it could still be useful to warn about non-numeric filepos values).

DiapDealer · 12-11-2011, 02:33 PM

I see you've patched the 0.29 version of mobiunpack.py. Is that the version you were using when you discovered the issue?

I only ask because v0.32 of mobiunpack.py (the latest can always be found in post #5 of this thread) seems to have an updated regex pattern that would seem to achieve the same result as the regex in your patch:

From v0.32

Code:

link_pattern = re.compile(r'''<[^<>]+filepos=['"]{0,1}(\d+)[^<>]*>''', re.IGNORECASE)

From your patch

Code:

link_pattern = re.compile(r'''<a[ ]+filepos=['"]{0,1}0*(\d+)['"]{0,1} *>''', re.IGNORECASE)

Have you tried v0.32 to see if this issue might be a non-starter?

sourcejedi · 12-11-2011, 05:40 PM

Sorry, yes. 0.32 from this thread works correctly. I was using the version from Siebert's git repo which describes itself as 0.29.
Thanks for pointing it out.

I'm probably used to assuming 'git' means 'the latest version'. But that's not true in general, and I should have said where I got the program from.

[Nitpick: I think you quoted the wrong link_pattern - there's two of them, and the first appears unchanged. The relevant one has your name next to it in 0.32

.

Code:

                # Two different regex search and replace routines.
                # Best results are with the second so far IMO (DiapDealer).
                
                #link_pattern = re.compile(r'''<a filepos=['"]{0,1}0*(\d+)['"]{0,1} *>''', re.IGNORECASE)
                link_pattern = re.compile(r'''<a\s+filepos=['"]{0,1}0*(\d+)['"]{0,1}(.*?)>''', re.IGNORECASE)
                #srctext = link_pattern.sub(r'''<a href="#filepos\1">''', srctext)
                srctext = link_pattern.sub(r'''<a href="#filepos\1"\2>''', srctext)

DiapDealer · 12-11-2011, 06:29 PM

Quote:

Originally Posted by sourcejedi

[Nitpick: I think you quoted the wrong link_pattern - there's two of them, and the first appears unchanged. The relevant one has your name next to it in 0.32

That's not nitpicking... that's flat-out busting me for taking such a cursory glance at the code.

KevinH · 12-16-2011, 11:18 AM

Hi All,

You should check out the following links to get copies of the new amazon k8 format files to play around with and test with:

http://www.the-digital-reader.com/20...now-available/

I grabbed the Jerome.mobi and tried unpacking it via mobiunpack.py with all DEBUG turned on.

It seems that Amazon have simply combined two different mobi ebooks into one palm doc container.

The one at the top is simply the normal mobi and mobiunpack works well on it but it generates extra raw pieces. You can find all of these extra raw pieces hidden away as image*.raw files inside the images folder. These include FONT and RESC files plus copies of each section in its own file until the end of the palm doc. So by examining these extra image*.raw files in a text editor we can see what each section of the palmdoc contains.

Immediately after the normal mobi ebook (in the very next section) you can find a whole section that appears to be nothing but the word "BOUNDARY" which seems to be the divider between the older .mobi file format and the new format.

It is followed by what looks like a new section 0 mobi header, and that is followed by all of the raw .xhtml in each section until the end (but unlike true image sections these has been compressed so we will need to uncompress them to see what the new xhtml looks like. So the old format mobi is at the top of the palmdoc container and immediately after the images and FLIS, FCIS (the images appear to by shared by both versions of the ebook) you can see the pieces that make up the new format.

So it appears we can look for things in the first mobi header that indicates that that a KF8 style data is included, and then parse those records using the new section 0 very much like we process the original mobi.

So anyone want to take a shot to modify the latest mobiunpack to unpack both versions of the files for these new K8s?

Volunteers welcome!

DaleDe · 12-16-2011, 12:07 PM

The second entry is the source file I believe, generally an ePub exactly duplicated. Or are you talking about some other data?

KevinH · 12-16-2011, 12:17 PM

Hi,

No there is a separate section for the source zip file as well. No we are talking about the a the k8 version of the ebook packed immediately after the normal mobi one in one palmdoc container.

Grab version 0.32 of mobiunpack, edit it with a text editor to set DEBUG = True and run it on that K8 ebook and examine the extra .raw sections stored under debug mode inside of the image folder to see what I am referring to.

KevinH · 12-23-2011, 01:59 PM

Hi,

Just in case anyone wants to play around with the latest K8 .mobi files, I have attached a newest_mobi_unpack.zip

I made massive changes and reorganized everything and split it into many different files and then renamed it to mobi_unpack to prevent confusion.

This is very experimental and probably will not work for you.

But if you want to play around, download and unzip it. Copy the test Jerome.mobi (see earlier link) into that directory. Change to that directory and then run:

python ./mobi_unpack.py Jerome.mobi test/

(or whatever the windows equivalent is if you are on windows)

If it works, inside of test you should see the original mobi info, a K8 folder that has the new K8 xhtml files, and a Jerome.epub which is the epub created from the new K8 files.

You should also see a kindlegensrc.zip file which represents the original epub that was used to generate the Jerome.mobi which you can unzip and compare against the files in the K8 folder or the Jerome.epub.

Please report any difficulties so we can fix any bugs.

Happy Holidays!

KevinH

KevinH · 12-24-2011, 10:46 PM

FYI:
DiapDealer found and fixed a number of bugs in the new mobi_unpack program for K8 files.
Thanks to DiapDealer!

So if anyone wants the updated version, check out my later posts in this thread to find the very latest version.

KevinH

lizcastro · 01-12-2012, 01:25 PM

Thanks, Kevin! This is so helpful.

Can you confirm that the only thing mobi_unpack does is show what was in the mobi file? It doesn't generate anything, right?

When I convert an EPUB file to mobi with KindleGen2, and then unpack it with your latest version of mobi_unpack, I get a folder that contains a smaller version of the EPUB file than the original, an HTML file with what looks like the contents of the entire book, along with an ncx and opf file, and a folder with reduced size images.

Then, there's a K8 folder that contains a completely re-engineered set of files, all renamed, resized images, etc. of what was originally in my EPUB file.

And then there's a kindlegensrc.zip file, that when unzipped, contains my original unaltered files.

It all seems so excessive.

thanks,
Liz

12-10-2011, 04:56 PM	#229
sourcejedi Groupie Posts: 155 Karma: 200000 Join Date: Dec 2009 Location: Britania Device: Android	Round-trip failure with mobiunpack & kindlegen v1.2 on linux [If this should be a new thread, please do ask mods to move it] This is not a support request. Just to let you know I noticed a round-trip failure using mobiunpack, kindlegen 1.2 for linux, and a Mobipocket edition of one of the Young Wizards books. I'm curious whether this is a known bug. I unpacked it, edited the "HTML", and invoked Kindlegen on the OPF file. (That's generall expected to work, right?) No problem so far; FBReader seemed happy with the new MOBI file. But then I tried to verify it by unpacking the new MOBI and checking for differences. This happened - Code: <p height="0pt" width="0pt" align="justify"><a filepos=0000008568 ><font color="blue"><u>Consultations</u></font></a></p> i.e. a number of links are output as filepos= instead of href= - here's the original: Code: <p height="0pt" width="0pt" align="justify"><a href="#filepos8519"><font color="blue"><u>Consultations</u></font></a></p> I double-checked the new MOBI in FBReader, and that specific link is working fine. mobiunpack does seem to find the matching anchor; it's just the links that have gone weird. Code: <mbp:pagebreak/></div><div><a id="filepos8568" /><a id="filepos8568" /> <p height="1em" width="0pt" align="center"><font size="5"><b><font color="red"> Consultations</font></b></font></p> ISTR hearing that having multiple anchors next to each other in MobiPocket can be bad news... I think mobiunpack generates them because there are two links to the same location (from two different tables of contents)... but if that were the problem, I'd have thought it would show up as KindleGen dying, or a loss of functionality in the MOBI file, which hasn't happened... FULL DISCLOSURE. The original MOBI also includes some "dead links" (href="../Text/#filepos6634"). After the round-trip, these appear as filepos=XXXXXXXX. So, it's possible these dead links are confusing mobiunpack, although I'm not sure how. [KindleGen warns "Warning(prcgen): Hyperlink not resolved", but continued anyway. I don't see any other warnings. Ideally mobiunpack would provide a similar warning during unpacking, so you can tell something odd has happened.] Second disclosure. From the above evidence, I believe that the "original" MOBI has already gone through at least one MOBI->EPUB->MOBI conversion. (Presumably edited in Sigil in between). I have a copy of what I assume is the EPUB version. The EPUB also has "calibre" written all over it (class="calibre"). So it's quite possible the MOBI I started with was generated by Calibre's reverse-engineered code, as opposed to the official MobiPocket/Kindle conversion code. Last edited by sourcejedi; 12-11-2011 at 12:42 PM. Reason: CODE tags preserve significant whitespace

12-11-2011, 02:33 PM	#232
DiapDealer Grand Sorcerer Posts: 28,915 Karma: 207182180 Join Date: Jan 2010 Device: Nexus 7, Kindle Fire HD	I see you've patched the 0.29 version of mobiunpack.py. Is that the version you were using when you discovered the issue? I only ask because v0.32 of mobiunpack.py (the latest can always be found in post #5 of this thread) seems to have an updated regex pattern that would seem to achieve the same result as the regex in your patch: From v0.32 Code: link_pattern = re.compile(r'''<[^<>]+filepos=['"]{0,1}(\d+)[^<>]>''', re.IGNORECASE) From your patch Code: link_pattern = re.compile(r'''<a[ ]+filepos=['"]{0,1}0(\d+)['"]{0,1} >''', re.IGNORECASE) Have you tried v0.32 to see if this issue might be a non-starter? Last edited by DiapDealer; 12-11-2011 at 02:42 PM.*

12-11-2011, 05:40 PM	#233
sourcejedi Groupie Posts: 155 Karma: 200000 Join Date: Dec 2009 Location: Britania Device: Android	Sorry, yes. 0.32 from this thread works correctly. I was using the version from Siebert's git repo which describes itself as 0.29. Thanks for pointing it out. I'm probably used to assuming 'git' means 'the latest version'. But that's not true in general, and I should have said where I got the program from. [Nitpick: I think you quoted the wrong link_pattern - there's two of them, and the first appears unchanged. The relevant one has your name next to it in 0.32 . Code: # Two different regex search and replace routines. # Best results are with the second so far IMO (DiapDealer). #link_pattern = re.compile(r'''<a filepos=['"]{0,1}0(\d+)['"]{0,1} >''', re.IGNORECASE) link_pattern = re.compile(r'''<a\s+filepos=['"]{0,1}0(\d+)['"]{0,1}(.?)>''', re.IGNORECASE) #srctext = link_pattern.sub(r'''<a href="#filepos\1">''', srctext) srctext = link_pattern.sub(r'''<a href="#filepos\1"\2>''', srctext)

12-16-2011, 11:18 AM	#235
KevinH Sigil Developer Posts: 9,093 Karma: 6404930 Join Date: Nov 2009 Device: many	mobiunpack and the new K8 format Hi All, You should check out the following links to get copies of the new amazon k8 format files to play around with and test with: http://www.the-digital-reader.com/20...now-available/ I grabbed the Jerome.mobi and tried unpacking it via mobiunpack.py with all DEBUG turned on. It seems that Amazon have simply combined two different mobi ebooks into one palm doc container. The one at the top is simply the normal mobi and mobiunpack works well on it but it generates extra raw pieces. You can find all of these extra raw pieces hidden away as image.raw files inside the images folder. These include FONT and RESC files plus copies of each section in its own file until the end of the palm doc. So by examining these extra image.raw files in a text editor we can see what each section of the palmdoc contains. Immediately after the normal mobi ebook (in the very next section) you can find a whole section that appears to be nothing but the word "BOUNDARY" which seems to be the divider between the older .mobi file format and the new format. It is followed by what looks like a new section 0 mobi header, and that is followed by all of the raw .xhtml in each section until the end (but unlike true image sections these has been compressed so we will need to uncompress them to see what the new xhtml looks like. So the old format mobi is at the top of the palmdoc container and immediately after the images and FLIS, FCIS (the images appear to by shared by both versions of the ebook) you can see the pieces that make up the new format. So it appears we can look for things in the first mobi header that indicates that that a KF8 style data is included, and then parse those records using the new section 0 very much like we process the original mobi. So anyone want to take a shot to modify the latest mobiunpack to unpack both versions of the files for these new K8s? Volunteers welcome! Last edited by KevinH; 12-16-2011 at 11:22 AM.

12-23-2011, 01:59 PM	#238
KevinH Sigil Developer Posts: 9,093 Karma: 6404930 Join Date: Nov 2009 Device: many	very experimental mobi unpack for K8 mobis Hi, Just in case anyone wants to play around with the latest K8 .mobi files, I have attached a newest_mobi_unpack.zip I made massive changes and reorganized everything and split it into many different files and then renamed it to mobi_unpack to prevent confusion. This is very experimental and probably will not work for you. But if you want to play around, download and unzip it. Copy the test Jerome.mobi (see earlier link) into that directory. Change to that directory and then run: python ./mobi_unpack.py Jerome.mobi test/ (or whatever the windows equivalent is if you are on windows) If it works, inside of test you should see the original mobi info, a K8 folder that has the new K8 xhtml files, and a Jerome.epub which is the epub created from the new K8 files. You should also see a kindlegensrc.zip file which represents the original epub that was used to generate the Jerome.mobi which you can unzip and compare against the files in the K8 folder or the Jerome.epub. Please report any difficulties so we can fix any bugs. Happy Holidays! KevinH Last edited by KevinH; 01-14-2012 at 05:49 PM. Reason: remove attachment, updated version in later posts

12-16-2011, 12:07 PM	#236
DaleDe Grand Sorcerer Posts: 11,470 Karma: 13095790 Join Date: Aug 2007 Location: Grass Valley, CA Device: EB 1150, EZ Reader, Literati, iPad 2 & Air 2, iPhone 7	The second entry is the source file I believe, generally an ePub exactly duplicated. Or are you talking about some other data?

12-16-2011, 12:17 PM	#237
KevinH Sigil Developer Posts: 9,093 Karma: 6404930 Join Date: Nov 2009 Device: many	Hi, No there is a separate section for the source zip file as well. No we are talking about the a the k8 version of the ebook packed immediately after the normal mobi one in one palmdoc container. Grab version 0.32 of mobiunpack, edit it with a text editor to set DEBUG = True and run it on that K8 ebook and examine the extra .raw sections stored under debug mode inside of the image folder to see what I am referring to.

12-24-2011, 10:46 PM	#239
KevinH Sigil Developer Posts: 9,093 Karma: 6404930 Join Date: Nov 2009 Device: many	FYI: DiapDealer found and fixed a number of bugs in the new mobi_unpack program for K8 files. Thanks to DiapDealer! So if anyone wants the updated version, check out my later posts in this thread to find the very latest version. KevinH Last edited by KevinH; 01-14-2012 at 05:50 PM. Reason: removed older version attachment, directing people to newer version

01-12-2012, 01:25 PM	#240
lizcastro Member Posts: 16 Karma: 148 Join Date: Apr 2010 Device: iPad, NOOK, Kindle, Kobo	Thanks, Kevin! This is so helpful. Can you confirm that the only thing mobi_unpack does is show what was in the mobi file? It doesn't generate anything, right? When I convert an EPUB file to mobi with KindleGen2, and then unpack it with your latest version of mobi_unpack, I get a folder that contains a smaller version of the EPUB file than the original, an HTML file with what looks like the contents of the entire book, along with an ncx and opf file, and a folder with reduced size images. Then, there's a K8 folder that contains a completely re-engineered set of files, all renamed, resized images, etc. of what was originally in my EPUB file. And then there's a kindlegensrc.zip file, that when unzipped, contains my original unaltered files. It all seems so excessive. thanks, Liz

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Can i rotate text and insert images in Mobi and EPUB?	JanGLi	Kindle Formats	5	02-02-2013 05:16 PM
PDF to Mobi with text and images	pocketsprocket	Kindle Formats	7	05-21-2012 08:06 AM
Mobi files - images	DWC	Introduce Yourself	5	07-06-2011 02:43 AM
pdf to mobi... creating images rather than text	Dumhed	Calibre	5	11-06-2010 01:08 PM
Transfer of images on text files	anirudh215	PDF	2	06-22-2009 10:28 AM

Advert

Advert