MobileRead Forums - View Single Post - KindleUnpack (MobiUnpack): Extracts text, images and metadata from Kindle/Mobi files

KevinH · 10-09-2014, 09:33 AM

Hi tkeo,
Please note everything but file names and paths in mobi_k8proc.py must be bytes (bytestring), so as not to force the upconverting to unicode anyplace in that routine. So only filenames and path segments should have utf8_str on it and all literals should start with a b.

Since the type field was mixed sometimes 'inline' vs b'inline' or 'file vs b'file', this is what caused the None in the directory (it did not properly detect we had inlined the CDATA)

In mobi_cover.py, everything is converted to unicode asap then only the cover page xhtml should be encoded to utf-8 just before file creation.

I will check to make sure that this follows.

Thanks for testing. I will get a new version out after incorporating some of your changes but trying to keep to the rules I stated above.

Thanks,

KevinH

ps. I have attached a new version of nlib.zip below.

If this is no more stable than its predecessor, I am going to split out all binary routines into their own modules and convert to unicode at the boundary instead of trying to keep the code mostly the same as before.

It is just too hard to know what the data types are for each variable and so it gets confusing as to what should be converted to binary strings and what kept as unicode.

Right now most parts of mobi_header, mobi_index, mobi_dict, parts of mobi_ncx, parts of mobi_k8resc, mobi_k8proc, and most parts of mobi_html require some form of binary bytes code. Then kindleunpack has to deal with both types and convert properly when mixing them.

If I have to I will completely refactor the code to move all binary bytes processing to dedicated modules to try to prevent mixed data types in the same module completely.

I probably should have done that from the beginning but never dreamed it would be this hard to track down and fix all of the inconsistencies.

Thanks again for all of your hard work testing it and sending patches.

KevinH