MobileRead Forums - View Single Post - KindleUnpack (MobiUnpack): Extracts text, images and metadata from Kindle/Mobi files

Hi Kevin,

I have tested a few ebooks and got errors with the experimental code. Yes, it has (I think a lot

) bugs.

The experimental environment is as follows:

python versions are 2.7.6 and 3.3.4.1 for windows 32bit.
The codepage of the Windows is cp932.
PYTHONIOENCODING=utf-8 is set.

I have got following errors:

1. HDimage_test.mobi (an epub3 fixed layout ebook which I posted before)

Successfully unpacked with python 2; but with python 3, got an error message:

Spoiler:

2. test2.awz3 (an epub2 reflowable ebook in English with several images)
Got errors with the both versions.

with python 2:

Spoiler:

with python 3:

Spoiler:

3. kokoro.mobi (an epub3 rtl reflowable ebook in Japanese)
Unpacked as an epub2 ebook instead of the epub3 with the both versions.

I will see the code and debug if possible after tomorrow.

Take care,

10-05-2014, 10:40 AM	#1010
tkeo Connoisseur Posts: 94 Karma: 10 Join Date: Feb 2014 Location: Japan Device: Kindle PaperWhite, Kobo Aura HD	Hi Kevin, I have tested a few ebooks and got errors with the experimental code. Yes, it has (I think a lot) bugs. The experimental environment is as follows: python versions are 2.7.6 and 3.3.4.1 for windows 32bit. The codepage of the Windows is cp932. PYTHONIOENCODING=utf-8 is set. I have got following errors: 1. HDimage_test.mobi (an epub3 fixed layout ebook which I posted before) Successfully unpacked with python 2; but with python 3, got an error message: Spoiler: Unpacking Book... Palm DB type: BOOKMOBI, 38 sections. Unpacking a Combination M8/KF8 book... Processing Mobipocket 5 section of book... Mobi Version: 5 Codec: utf-8 Title: b'HD Content test' Palmdoc compression Unpacking images, resources, fonts, etc Extracting image: image00003.jpeg from section 3 Extracting image: image00004.jpeg from section 4 Extracting image: image00005.jpeg from section 5 Extracting image: image00006.jpeg from section 6 Extracting image: image00007.jpeg from section 7 Extracting image: cover00008.jpeg from section 8 Extracting image: image00010.jpeg from section 10 File contains kindlegen source archive, extracting as kindlegensrc.zip File contains kindlegen build log, extracting as kindlegenbuild.log Unpacking raw markup language Write ncx Find link anchors Insert data into html Insert hrefs into html Remove empty anchors from html Insert image references into html Building an opf for mobi7/azw4. Processing K8 section of book... Mobi Version: 8 Codec: utf-8 Title: b'HD Content test' Palmdoc compression Unpacking images, resources, fonts, etc Extracting HD image: HDimage00029.jpeg from section 29 Extracting HD image: HDimage00030.jpeg from section 30 Extracting HD image: HDimage00031.jpeg from section 31 Extracting HD image: HDimage00032.jpeg from section 32 Extracting HD image: HDimage00034.jpeg from section 34 Unpacking raw markup language Warning: There are unprocessed index bytes left: b'0000' Processing ncx / toc Building an epub-like structure Building proper xhtml for each file Traceback (most recent call last): File "kindleunpack.py", line 1008, in <module> sys.exit(main()) File "kindleunpack.py", line 996, in main unpackBook(infile, outdir, apnxfile, epubver, use_hd) File "kindleunpack.py", line 910, in unpackBook process_all_mobi_headers(files, apnxfile, sect, mhlst, K8Boundary, False, ep ubver, use_hd) File "kindleunpack.py", line 827, in process_all_mobi_headers processMobi8(mh, metadata, sect, files, imgnames, pagemapproc, k8resc, obfus cate_data, apnxfile, epubver) File "kindleunpack.py", line 523, in processMobi8 usedmap = htmlproc.buildXHTML() File "mobi_html.py", line 367, in bu ildXHTML replacement = b'%s%s%s'%(osep, b'../Images/' + imageName, csep) TypeError: can't concat bytes to str 2. test2.awz3 (an epub2 reflowable ebook in English with several images) Got errors with the both versions. with python 2: Spoiler: Unpacking Book... Palm DB type: BOOKMOBI, 190 sections. Warning: Bad key, size, value combination detected in EXTH 406 16 0000000000000 000 Unpacking a KF8 book... Processing K8 section of book... Mobi Version: 8 Codec: utf-8 Title: XXXXXXXX EXTH Title: XXXXXXXX Huffdic compression Unpacking images, resources, fonts, etc Extracting image: image00172.jpeg from section 172 Extracting image: image00173.jpeg from section 173 Extracting image: image00174.gif from section 174 Extracting image: image00175.gif from section 175 Extracting image: image00176.jpeg from section 176 Extracting image: image00177.gif from section 177 Extracting image: image00178.gif from section 178 Extracting image: cover00179.jpeg from section 179 Extracting image: image00180.jpeg from section 180 Extracting image: image00181.jpeg from section 181 Extracting image: image00183.jpeg from section 183 Unpacking raw markup language Error: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range( 128) Traceback (most recent call last): File "kindleunpack.py", line 996, in main unpackBook(infile, outdir, apnxfile, epubver, use_hd) File "kindleunpack.py", line 910, in unpackBook process_all_mobi_headers(files, apnxfile, sect, mhlst, K8Boundary, False, ep ubver, use_hd) File "kindleunpack.py", line 827, in process_all_mobi_headers processMobi8(mh, metadata, sect, files, imgnames, pagemapproc, k8resc, obfus cate_data, apnxfile, epubver) File "kindleunpack.py", line 456, in processMobi8 rawML = mh.getRawML() File "mobi_header.py", line 785, in getRawML dataList.append(self.unpack(data)) File "mobi_uncompress.py", line 131, in unpack slice = self.unpack(slice) File "mobi_uncompress.py", line 133, in unpack s += slice UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128) with python 3: Spoiler: Unpacking Book... Palm DB type: BOOKMOBI, 190 sections. Traceback (most recent call last): File "kindleunpack.py", line 1008, in <module> sys.exit(main()) File "kindleunpack.py", line 996, in main unpackBook(infile, outdir, apnxfile, epubver, use_hd) File "kindleunpack.py", line 869, in unpackBook mh = MobiHeader(sect,0) File "mobi_header.py", line 484, in __init__ reader.loadCdic(self.sect.loadSection(huffoff+i)) File "mobi_uncompress.py", line 97, in loadCdic self.dictionary += lmap(getslice, struct.unpack_from(b'>%dH' % n, cdic, 16)) TypeError: unsupported operand type(s) for %: 'bytes' and 'int' 3. kokoro.mobi (an epub3 rtl reflowable ebook in Japanese) Unpacked as an epub2 ebook instead of the epub3 with the both versions. I will see the code and debug if possible after tomorrow. Take care, Last edited by tkeo; 10-05-2014 at 10:49 AM.