MobileRead Forums - View Single Post - KindleUnpack (MobiUnpack): Extracts text, images and metadata from Kindle/Mobi files

KevinH · 10-04-2014, 12:12 PM

Hi tkeo,

Attached is my experimental conversion of kindleunpack (command line only - no gui yet) to run on both Python 2.7 and Python 3.4. It should run on Python 3.3 as well but I only have Python 3.4.1 on my machine to test it with.

Please note:

1. this has only been tested by unpacking one large kindlegen generated mobi and a diff -urN showed that nothing was amiss when compared to standard Kindleunpack.

2. It has not been tested with a Japanese ebook nor a fixed layout or anything complex so there still will be lots of bugs to iron out. No font encryption/decryption has been tested nor have any dictionaries. I expect there still to be many bugs in those sections of code.

3. Because even one small piece of bytestring vs unicode can mess things up in python 3, care must be taken when making any changes ...

Right now, and after much trial and error -

I keep the actual html files and processing of it (building RawML in mobi_header.py, mobi_k8proc.py and mobi_html.py as working only in bytestrings since byte offsets are needed for link targets and for inserting fragments. Any conversion to unicode would throw off all of the byte offsets horribly and must be avoided at all costs until all position / byte offsets have been processed.

The same holds true for processing the binary index data.

I convert to and use unicode for mobi_ncx.py, mobi_nav.py, mobi_opf.py, and for all metadata (in mobi_header.py).

In mobi_k8resc.py and mobi_pagemap.py I start processing with bytestrings until the resc data is extracted and then convert to operating in full unicode.

Unfortunately, kindleunpack.py, and mobi_utils is a mix of bytestring and full unicode since it has to deal with all of this nonsense coming from different directions.

4. In other words ... this port is very temperamental and very fragile. More work will need to be done to stabilize it and revisit how soon we can convert to unicode in the html processing.

Manipulating bytes in Python3 is limited and fraught with inconsistencies ...

- no use of % to fold ascii or utf-8 strings into binary data (there is a pep on this)

- struct.unpack will only work with bytestring formats in python all the way up to and including python 2.7.5 and possibly later

- issues with iterating bytes and extracting single bytes from bytestrings, and there is a pep on this as well (pep 467) but nothing definite yet

- issues with "re" requiring byte patters to work on bytstrings and visa-versa

- lots of inconsistencies with many other things I have tried to take care of with the compatibility_utils.py code I have collected from all over the net

It is clear that the official python programmers have never had to work close to the metal, nor worked with packed binary data, otherwise they would never have given bytestrings such a second-rate, inferior implementation. In fact, it was not until recent Python 3.3 and 3.4 releases I would have ever even tried to use Python 3 as, bytes support was just too horribly broken in python 3.0, 3.1, and 3.2 to contemplate.

Instead of breaking backwards compatibility going from 2 to 3, all they had to do was start aggressively deprecating auto-conversion of bytestrings to unicode, forcing the developer to slowly track down and change things before removing the support.

Now they seem to be stuck defending their initial stupidity and holding firm to their "ideals of 'unicode or die' - kill all bytestrings use for text" even though it is killing the uptake of Python 3.

Oh well, please post any bugs here so that I can try and get them fixed. I would like this to eventually become the future codebase of kindleunpack moving forward.

Take care,

KevinH

10-04-2014, 12:12 PM	#1001
KevinH Sigil Developer Posts: 8,893 Karma: 6120478 Join Date: Nov 2009 Device: many	experimental command-line version of kindleunpack for both Python 2 and Python 3 Hi tkeo, Attached is my experimental conversion of kindleunpack (command line only - no gui yet) to run on both Python 2.7 and Python 3.4. It should run on Python 3.3 as well but I only have Python 3.4.1 on my machine to test it with. Please note: 1. this has only been tested by unpacking one large kindlegen generated mobi and a diff -urN showed that nothing was amiss when compared to standard Kindleunpack. 2. It has not been tested with a Japanese ebook nor a fixed layout or anything complex so there still will be lots of bugs to iron out. No font encryption/decryption has been tested nor have any dictionaries. I expect there still to be many bugs in those sections of code. 3. Because even one small piece of bytestring vs unicode can mess things up in python 3, care must be taken when making any changes ... Right now, and after much trial and error - I keep the actual html files and processing of it (building RawML in mobi_header.py, mobi_k8proc.py and mobi_html.py as working only in bytestrings since byte offsets are needed for link targets and for inserting fragments. Any conversion to unicode would throw off all of the byte offsets horribly and must be avoided at all costs until all position / byte offsets have been processed. The same holds true for processing the binary index data. I convert to and use unicode for mobi_ncx.py, mobi_nav.py, mobi_opf.py, and for all metadata (in mobi_header.py). In mobi_k8resc.py and mobi_pagemap.py I start processing with bytestrings until the resc data is extracted and then convert to operating in full unicode. Unfortunately, kindleunpack.py, and mobi_utils is a mix of bytestring and full unicode since it has to deal with all of this nonsense coming from different directions. 4. In other words ... this port is very temperamental and very fragile. More work will need to be done to stabilize it and revisit how soon we can convert to unicode in the html processing. Manipulating bytes in Python3 is limited and fraught with inconsistencies ... - no use of % to fold ascii or utf-8 strings into binary data (there is a pep on this) - struct.unpack will only work with bytestring formats in python all the way up to and including python 2.7.5 and possibly later - issues with iterating bytes and extracting single bytes from bytestrings, and there is a pep on this as well (pep 467) but nothing definite yet - issues with "re" requiring byte patters to work on bytstrings and visa-versa - lots of inconsistencies with many other things I have tried to take care of with the compatibility_utils.py code I have collected from all over the net It is clear that the official python programmers have never had to work close to the metal, nor worked with packed binary data, otherwise they would never have given bytestrings such a second-rate, inferior implementation. In fact, it was not until recent Python 3.3 and 3.4 releases I would have ever even tried to use Python 3 as, bytes support was just too horribly broken in python 3.0, 3.1, and 3.2 to contemplate. Instead of breaking backwards compatibility going from 2 to 3, all they had to do was start aggressively deprecating auto-conversion of bytestrings to unicode, forcing the developer to slowly track down and change things before removing the support. Now they seem to be stuck defending their initial stupidity and holding firm to their "ideals of 'unicode or die' - kill all bytestrings use for text" even though it is killing the uptake of Python 3. Oh well, please post any bugs here so that I can try and get them fixed. I would like this to eventually become the future codebase of kindleunpack moving forward. Take care, KevinH Last edited by KevinH; 10-05-2014 at 11:45 AM.