KindleUnpack (MobiUnpack): Extracts text, images and metadata from Kindle/Mobi files - Page 67

pdurrant · 09-14-2014, 04:02 AM

I have updated the first post and the AppleScript.

DiapDealer · 09-15-2014, 05:12 PM

Out of curiosity, is the media-type "text/x-oeb1-document" found in a resource record within the MOBI when generating the content.opf file for a MOBI-only (non-KF8) kindlebook, or is it hardcoded in the KindleUnpack code? If the latter, is there a compelling reason for keeping it that way and not updating to an "application/xhtml+xml" media-type? I realize the markup file being produced isn't really xhtml, but "text/x-oeb1-document" is deprecated in the latest 2.x OPF package we appear to be building. Is kindlegen even still accepting these unpacked old-style mobi-markup files as input anymore?

KevinH · 09-15-2014, 09:13 PM

Hi Doug,

In the mobi_opf.py in the part that builds the manifest for the opf, there is this media-map that determines things. The KF8 part unpacks to .xhtml file extensions while the older mobi part unpacks to .html so so gets that strange media-type.

Code:

media_map = {
                '.jpg'  : 'image/jpeg',
                '.jpeg' : 'image/jpeg',
                '.png'  : 'image/png',
                '.gif'  : 'image/gif',
                '.svg'  : 'image/svg+xml',
                '.xhtml': 'application/xhtml+xml',
                '.html' : 'text/x-oeb1-document', # for mobi7
                '.pdf'  : 'application/pdf', # for azw4(print replica textbook)
                '.ttf'  : 'application/x-font-ttf',
                '.otf'  : 'application/x-font-opentype', # replaced?
                #'.otf' : 'application/vnd.ms-opentype', # [OpenType] OpenType fonts
                #'.woff' : 'application/font-woff', # [WOFF] WOFF fonts
                #'.smil' : 'application/smil+xml', # [MediaOverlays301] EPUB Media Overlay documents
                #'.pls' : 'application/pls+xml', # [PLS] Text-to-Speech (TTS) Pronunciation lexicons
                '.otf'  : 'application/x-font-opentype', # replaced?
                #'.mp3'  : 'audio/mpeg',
                #'.mp4'  : 'audio/mp4',
                #'.js'   : 'text/javascript', # not supported in K8
                '.css'  : 'text/css'
                }

So it would be easy to change in KindleUnpack. That said, I passed a content.opf from an old mobi through kindlegen 2.9 and it generated a lot of warnings and built a KF8 part that would never pass any epub check.

So it looks like even Kindlegen is requiring a valid epub as input otherwise it generates junk for the KF8 part. I thought that unpacking an old mobi and then passing it back through kindlegen might be as easy way to convert from html 3 to true xhtml. No such luck.

I frankly think we should use the old mobiml2xhtml.py codebase (actually its newer cousin from your KindleImport) and try and create at least a basic, valid epub-like structure from the old mobi part. Kindlegen seems to be much more adept at taking valid epub xhtml and making old html 3 than doing the reverse.

If others agree, I would be happy to incorporate it into the next KindleUnpack release.

Take care,

Kevin

DiapDealer · 09-16-2014, 07:23 AM

I'd be OK with that. I do think we need to retain the ability to produce/examine the mobiml file, though: if only for testing and for seeing what mobiml code is actually being produced by certain xhtml/epub input.

tkeo · 09-17-2014, 08:38 AM

Hi,

I have no reason to oppose to implement a new feature.
But I'd like to ask what is mobiml?

Thanks,

DiapDealer · 09-17-2014, 08:54 AM

Quote:

Originally Posted by tkeo

Hi,

I have no reason to oppose to implement a new feature.
But I'd like to ask what is mobiml?

Thanks,

Sorry. It's just a shorcut to what I (and others) would call the mobi markup language. It's what's in the *.html file in the Mobi 7 folder. The (nearly) raw output of the mobi-only portion of a kindlebook (image references and the like are rebuilt). Very similar to HTML 3 with a few additions (and plenty of garbage).

There's currently some work going on in another project to upgrade a semi-retired mobiml2html script to take that mobi markup and spit out something as close to xhtml as possible (while maintaining the formatting of the original book). It's made more difficult by the sheer amount of junk that can sometimes be found in that mobi markup (inline elements that cross block-level element boundaries, improperly nested and/or mismatched tags, as well as opf and ncx markup in the headers and bodies). Not to mention tags that are invalid/deprecated in xhtml.

KevinH · 10-03-2014, 05:02 PM

Hi All,

I have an experimental version of KindleUnpack that will run on both Python 2.7 and Python 3.4 at the same time. The conversion took much longer than I expected because of massively ass-backwards decisions by the developers of Python 3. If interested please see the following issue:

http://bugs.python.org/issue22549

The problem happens even if you use an iterator to access the bytes in a bytes string. The only way to get the actual characters in a bytestring is to use a slice.

This hit KindleUnpack horribly in the mobi_uncompress.py, mobi_dict.py, mobi_header.py, and mobi_index.py and the problem was hard to detect and not immediately obvious at all.

I think I now understand why Kovid is so reluctant to move to Python 3 or even Python 2 / Python 3 joint compatibility. It literally took me two entire days and evenings to make the initial conversion. Calibre is much much much larger, and has to manipulate bytestrings in places for binary format files even more then KindleUnpack does. Converting Calibre would take a herculean effort!

If there are any testers with both python 3.4 and python 2.7 installed who would like to play around with this experimental KindleUnpack via the command-line, please let me know and I will post it.

Otherwise, once I get the bugs ironed out, I will make an official release and we will attempt to keep future versions of KindleUnpack able to run on both platforms.

KevinH

JSWolf · 10-03-2014, 08:49 PM

WOW! How could the developers of Python 3 actually defend such a crappy way of doing things?

tkeo · 10-03-2014, 09:01 PM

Hi Kevin,

Thank you for your hard wrok.

I have thought moving to python 3 is necessary; but hegitated because it would be hard to unpack bytes and to handle utf-8.

Quote:

Originally Posted by KevinH

If there are any testers with both python 3.4 and python 2.7 installed who would like to play around with this experimental KindleUnpack via the command-line, please let me know and I will post it.

I'm willing to test it.
I have active python 3.3.4.1 (now uninstalled however). Is it work with 3.3?

Thanks,

KevinH · 10-04-2014, 09:46 AM

Hi tkeo,
Thanks. I still have a few more tests to run to exercise epub3 code and dictionary code then I will post it tonight for you.

Take care,

Kevin

KevinH · 10-04-2014, 12:12 PM

Hi tkeo,

Attached is my experimental conversion of kindleunpack (command line only - no gui yet) to run on both Python 2.7 and Python 3.4. It should run on Python 3.3 as well but I only have Python 3.4.1 on my machine to test it with.

Please note:

1. this has only been tested by unpacking one large kindlegen generated mobi and a diff -urN showed that nothing was amiss when compared to standard Kindleunpack.

2. It has not been tested with a Japanese ebook nor a fixed layout or anything complex so there still will be lots of bugs to iron out. No font encryption/decryption has been tested nor have any dictionaries. I expect there still to be many bugs in those sections of code.

3. Because even one small piece of bytestring vs unicode can mess things up in python 3, care must be taken when making any changes ...

Right now, and after much trial and error -

I keep the actual html files and processing of it (building RawML in mobi_header.py, mobi_k8proc.py and mobi_html.py as working only in bytestrings since byte offsets are needed for link targets and for inserting fragments. Any conversion to unicode would throw off all of the byte offsets horribly and must be avoided at all costs until all position / byte offsets have been processed.

The same holds true for processing the binary index data.

I convert to and use unicode for mobi_ncx.py, mobi_nav.py, mobi_opf.py, and for all metadata (in mobi_header.py).

In mobi_k8resc.py and mobi_pagemap.py I start processing with bytestrings until the resc data is extracted and then convert to operating in full unicode.

Unfortunately, kindleunpack.py, and mobi_utils is a mix of bytestring and full unicode since it has to deal with all of this nonsense coming from different directions.

4. In other words ... this port is very temperamental and very fragile. More work will need to be done to stabilize it and revisit how soon we can convert to unicode in the html processing.

Manipulating bytes in Python3 is limited and fraught with inconsistencies ...

- no use of % to fold ascii or utf-8 strings into binary data (there is a pep on this)

- struct.unpack will only work with bytestring formats in python all the way up to and including python 2.7.5 and possibly later

- issues with iterating bytes and extracting single bytes from bytestrings, and there is a pep on this as well (pep 467) but nothing definite yet

- issues with "re" requiring byte patters to work on bytstrings and visa-versa

- lots of inconsistencies with many other things I have tried to take care of with the compatibility_utils.py code I have collected from all over the net

It is clear that the official python programmers have never had to work close to the metal, nor worked with packed binary data, otherwise they would never have given bytestrings such a second-rate, inferior implementation. In fact, it was not until recent Python 3.3 and 3.4 releases I would have ever even tried to use Python 3 as, bytes support was just too horribly broken in python 3.0, 3.1, and 3.2 to contemplate.

Instead of breaking backwards compatibility going from 2 to 3, all they had to do was start aggressively deprecating auto-conversion of bytestrings to unicode, forcing the developer to slowly track down and change things before removing the support.

Now they seem to be stuck defending their initial stupidity and holding firm to their "ideals of 'unicode or die' - kill all bytestrings use for text" even though it is killing the uptake of Python 3.

Oh well, please post any bugs here so that I can try and get them fixed. I would like this to eventually become the future codebase of kindleunpack moving forward.

Take care,

KevinH

kovidgoyal · 10-04-2014, 01:55 PM

@KevinH: I see you've started discovering the joys of Python 3

Be glad you dont have to port any C extension modules. In Python 2 strings are internally always UTF-16 (except on linux) which is great because all external libraries (the windows API, ICU, etc.) all use UTF-16. As of python 3.3 a python string can be any of ascii, UCS2 or UCS4, depending on its contents. So now every time you call any external API function with a python string, you have to inspect and convert it. Joy, joy, joy.

And if you thought that dealing with binary file formats was bad, think about all the network facing code -- all network protocols are binary. I really dont know what the python 3 devs were smoking. Thank heavens python is open source and I can continue using python 2 for a long, long time. Hopefully, I can retire before it becomes necessary to port calibre from python 2.

JSWolf · 10-04-2014, 02:04 PM

Since KindleUnpack works with Python 2, why bother to make it also work with Python 3? Most people that use KindleUnpack also use other eBooks tools that are just for Python 2 and would not have a need for a version that works on Python 3.

KevinH · 10-04-2014, 02:41 PM

Hi,

One code base will now work for both, and if Sigil does package a python in the near future, it will most likely be python 3. So making KindleUnpack work on both python 2 and 3 maximizes its future usefulness to both calibre and sigil.

Also, this will also provide an example for Sigil plugin developers who want their plugins to work on both Python 2 and Python 3 as well. And it hedges our work just in case python 2's serious bugs never get fixed. Effectively it future-proofs our code.

KevinH

Quote:

Originally Posted by JSWolf

Since KindleUnpack works with Python 2, why bother to make it also work with Python 3? Most people that use KindleUnpack also use other eBooks tools that are just for Python 2 and would not have a need for a version that works on Python 3.

JSWolf · 10-04-2014, 03:53 PM

Quote:

Originally Posted by KevinH

Hi,

One code base will now work for both, and if Sigil does package a python in the near future, it will most likely be python 3. So making KindleUnpack work on both python 2 and 3 maximizes its future usefulness to both calibre and sigil.

Also, this will also provide an example for Sigil plugin developers who want their plugins to work on both Python 2 and Python 3 as well. And it hedges our work just in case python 2's serious bugs never get fixed. Effectively it future-proofs our code.

KevinH

I would think it would be wiser for Sigil to bundle Python 2 since there is a lot more code out there in Python 2 then Python 3. Many people dislike Python 3 and are sticking to Python 2. Plus, porting over Python 2 code is easier then porting over Python 2 code to run on Python 3. Add to that the fact that people who program in Python 2 would then have a learning curve moving to Python 3.

To be honest, it's best to Bundle Python 2 and forget Python 3 exists.

09-15-2014, 09:13 PM	#993
KevinH Sigil Developer Posts: 7,644 Karma: 5433388 Join Date: Nov 2009 Device: many	Hi Doug, In the mobi_opf.py in the part that builds the manifest for the opf, there is this media-map that determines things. The KF8 part unpacks to .xhtml file extensions while the older mobi part unpacks to .html so so gets that strange media-type. Code: media_map = { '.jpg' : 'image/jpeg', '.jpeg' : 'image/jpeg', '.png' : 'image/png', '.gif' : 'image/gif', '.svg' : 'image/svg+xml', '.xhtml': 'application/xhtml+xml', '.html' : 'text/x-oeb1-document', # for mobi7 '.pdf' : 'application/pdf', # for azw4(print replica textbook) '.ttf' : 'application/x-font-ttf', '.otf' : 'application/x-font-opentype', # replaced? #'.otf' : 'application/vnd.ms-opentype', # [OpenType] OpenType fonts #'.woff' : 'application/font-woff', # [WOFF] WOFF fonts #'.smil' : 'application/smil+xml', # [MediaOverlays301] EPUB Media Overlay documents #'.pls' : 'application/pls+xml', # [PLS] Text-to-Speech (TTS) Pronunciation lexicons '.otf' : 'application/x-font-opentype', # replaced? #'.mp3' : 'audio/mpeg', #'.mp4' : 'audio/mp4', #'.js' : 'text/javascript', # not supported in K8 '.css' : 'text/css' } So it would be easy to change in KindleUnpack. That said, I passed a content.opf from an old mobi through kindlegen 2.9 and it generated a lot of warnings and built a KF8 part that would never pass any epub check. So it looks like even Kindlegen is requiring a valid epub as input otherwise it generates junk for the KF8 part. I thought that unpacking an old mobi and then passing it back through kindlegen might be as easy way to convert from html 3 to true xhtml. No such luck. I frankly think we should use the old mobiml2xhtml.py codebase (actually its newer cousin from your KindleImport) and try and create at least a basic, valid epub-like structure from the old mobi part. Kindlegen seems to be much more adept at taking valid epub xhtml and making old html 3 than doing the reverse. If others agree, I would be happy to incorporate it into the next KindleUnpack release. Take care, Kevin

10-03-2014, 05:02 PM	#997
KevinH Sigil Developer Posts: 7,644 Karma: 5433388 Join Date: Nov 2009 Device: many	Hi All, I have an experimental version of KindleUnpack that will run on both Python 2.7 and Python 3.4 at the same time. The conversion took much longer than I expected because of massively ass-backwards decisions by the developers of Python 3. If interested please see the following issue: http://bugs.python.org/issue22549 The problem happens even if you use an iterator to access the bytes in a bytes string. The only way to get the actual characters in a bytestring is to use a slice. This hit KindleUnpack horribly in the mobi_uncompress.py, mobi_dict.py, mobi_header.py, and mobi_index.py and the problem was hard to detect and not immediately obvious at all. I think I now understand why Kovid is so reluctant to move to Python 3 or even Python 2 / Python 3 joint compatibility. It literally took me two entire days and evenings to make the initial conversion. Calibre is much much much larger, and has to manipulate bytestrings in places for binary format files even more then KindleUnpack does. Converting Calibre would take a herculean effort! If there are any testers with both python 3.4 and python 2.7 installed who would like to play around with this experimental KindleUnpack via the command-line, please let me know and I will post it. Otherwise, once I get the bugs ironed out, I will make an official release and we will attempt to keep future versions of KindleUnpack able to run on both platforms. KevinH Last edited by KevinH; 10-03-2014 at 05:31 PM.

10-04-2014, 12:12 PM	#1001
KevinH Sigil Developer Posts: 7,644 Karma: 5433388 Join Date: Nov 2009 Device: many	experimental command-line version of kindleunpack for both Python 2 and Python 3 Hi tkeo, Attached is my experimental conversion of kindleunpack (command line only - no gui yet) to run on both Python 2.7 and Python 3.4. It should run on Python 3.3 as well but I only have Python 3.4.1 on my machine to test it with. Please note: 1. this has only been tested by unpacking one large kindlegen generated mobi and a diff -urN showed that nothing was amiss when compared to standard Kindleunpack. 2. It has not been tested with a Japanese ebook nor a fixed layout or anything complex so there still will be lots of bugs to iron out. No font encryption/decryption has been tested nor have any dictionaries. I expect there still to be many bugs in those sections of code. 3. Because even one small piece of bytestring vs unicode can mess things up in python 3, care must be taken when making any changes ... Right now, and after much trial and error - I keep the actual html files and processing of it (building RawML in mobi_header.py, mobi_k8proc.py and mobi_html.py as working only in bytestrings since byte offsets are needed for link targets and for inserting fragments. Any conversion to unicode would throw off all of the byte offsets horribly and must be avoided at all costs until all position / byte offsets have been processed. The same holds true for processing the binary index data. I convert to and use unicode for mobi_ncx.py, mobi_nav.py, mobi_opf.py, and for all metadata (in mobi_header.py). In mobi_k8resc.py and mobi_pagemap.py I start processing with bytestrings until the resc data is extracted and then convert to operating in full unicode. Unfortunately, kindleunpack.py, and mobi_utils is a mix of bytestring and full unicode since it has to deal with all of this nonsense coming from different directions. 4. In other words ... this port is very temperamental and very fragile. More work will need to be done to stabilize it and revisit how soon we can convert to unicode in the html processing. Manipulating bytes in Python3 is limited and fraught with inconsistencies ... - no use of % to fold ascii or utf-8 strings into binary data (there is a pep on this) - struct.unpack will only work with bytestring formats in python all the way up to and including python 2.7.5 and possibly later - issues with iterating bytes and extracting single bytes from bytestrings, and there is a pep on this as well (pep 467) but nothing definite yet - issues with "re" requiring byte patters to work on bytstrings and visa-versa - lots of inconsistencies with many other things I have tried to take care of with the compatibility_utils.py code I have collected from all over the net It is clear that the official python programmers have never had to work close to the metal, nor worked with packed binary data, otherwise they would never have given bytestrings such a second-rate, inferior implementation. In fact, it was not until recent Python 3.3 and 3.4 releases I would have ever even tried to use Python 3 as, bytes support was just too horribly broken in python 3.0, 3.1, and 3.2 to contemplate. Instead of breaking backwards compatibility going from 2 to 3, all they had to do was start aggressively deprecating auto-conversion of bytestrings to unicode, forcing the developer to slowly track down and change things before removing the support. Now they seem to be stuck defending their initial stupidity and holding firm to their "ideals of 'unicode or die' - kill all bytestrings use for text" even though it is killing the uptake of Python 3. Oh well, please post any bugs here so that I can try and get them fixed. I would like this to eventually become the future codebase of kindleunpack moving forward. Take care, KevinH Last edited by KevinH; 10-05-2014 at 11:45 AM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Can i rotate text and insert images in Mobi and EPUB?	JanGLi	Kindle Formats	5	02-02-2013 04:16 PM
PDF to Mobi with text and images	pocketsprocket	Kindle Formats	7	05-21-2012 07:06 AM
Mobi files - images	DWC	Introduce Yourself	5	07-06-2011 01:43 AM
pdf to mobi... creating images rather than text	Dumhed	Calibre	5	11-06-2010 12:08 PM
Transfer of images on text files	anirudh215	PDF	2	06-22-2009 09:28 AM

09-14-2014, 04:02 AM	#991
pdurrant The Grand Mouse 高貴的老鼠 Posts: 71,504 Karma: 306214458 Join Date: Jul 2007 Location: Norfolk, England Device: Kindle Voyage	I have updated the first post and the AppleScript.

09-15-2014, 05:12 PM	#992
DiapDealer Grand Sorcerer Posts: 27,548 Karma: 193191846 Join Date: Jan 2010 Device: Nexus 7, Kindle Fire HD	Out of curiosity, is the media-type "text/x-oeb1-document" found in a resource record within the MOBI when generating the content.opf file for a MOBI-only (non-KF8) kindlebook, or is it hardcoded in the KindleUnpack code? If the latter, is there a compelling reason for keeping it that way and not updating to an "application/xhtml+xml" media-type? I realize the markup file being produced isn't really xhtml, but "text/x-oeb1-document" is deprecated in the latest 2.x OPF package we appear to be building. Is kindlegen even still accepting these unpacked old-style mobi-markup files as input anymore?

09-16-2014, 07:23 AM	#994
DiapDealer Grand Sorcerer Posts: 27,548 Karma: 193191846 Join Date: Jan 2010 Device: Nexus 7, Kindle Fire HD	I'd be OK with that. I do think we need to retain the ability to produce/examine the mobiml file, though: if only for testing and for seeing what mobiml code is actually being produced by certain xhtml/epub input.

09-17-2014, 08:38 AM	#995
tkeo Connoisseur Posts: 94 Karma: 10 Join Date: Feb 2014 Location: Japan Device: Kindle PaperWhite, Kobo Aura HD	Hi, I have no reason to oppose to implement a new feature. But I'd like to ask what is mobiml? Thanks,

10-03-2014, 08:49 PM	#998
JSWolf Resident Curmudgeon Posts: 73,957 Karma: 128903250 Join Date: Nov 2006 Location: Roslindale, Massachusetts Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3	WOW! How could the developers of Python 3 actually defend such a crappy way of doing things?

10-04-2014, 09:46 AM	#1000
KevinH Sigil Developer Posts: 7,644 Karma: 5433388 Join Date: Nov 2009 Device: many	Hi tkeo, Thanks. I still have a few more tests to run to exercise epub3 code and dictionary code then I will post it tonight for you. Take care, Kevin

10-04-2014, 01:55 PM	#1002
kovidgoyal creator of calibre Posts: 43,851 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	@KevinH: I see you've started discovering the joys of Python 3 Be glad you dont have to port any C extension modules. In Python 2 strings are internally always UTF-16 (except on linux) which is great because all external libraries (the windows API, ICU, etc.) all use UTF-16. As of python 3.3 a python string can be any of ascii, UCS2 or UCS4, depending on its contents. So now every time you call any external API function with a python string, you have to inspect and convert it. Joy, joy, joy. And if you thought that dealing with binary file formats was bad, think about all the network facing code -- all network protocols are binary. I really dont know what the python 3 devs were smoking. Thank heavens python is open source and I can continue using python 2 for a long, long time. Hopefully, I can retire before it becomes necessary to port calibre from python 2.

10-04-2014, 02:04 PM	#1003
JSWolf Resident Curmudgeon Posts: 73,957 Karma: 128903250 Join Date: Nov 2006 Location: Roslindale, Massachusetts Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3	Since KindleUnpack works with Python 2, why bother to make it also work with Python 3? Most people that use KindleUnpack also use other eBooks tools that are just for Python 2 and would not have a need for a version that works on Python 3.

Advert

Advert