MobileRead Forums - View Single Post - KindleUnpack (MobiUnpack): Extracts text, images and metadata from Kindle/Mobi files

KevinH · 03-17-2013, 11:55 AM

Hi Nick,

It seems the Description metadata item in your np.mobi testcase is properly utf-8 encoded (and it does correctly encode and use non-ascii characters) - notice the smart quotes and accented chars in this snippet.

I looked at the Description in a hex editor and all of the smart quotes appear to be utf-8 encoded and not cp1251.

---
Key: "Description"
Value: "Daily news from the National Post

Articles in this issue:
Is the war on cancer an ‘utter failure’?: A sobering look at how billions in research money is spent

Jean Chrétien: A capable caretaker, but no statesman
---

The error you reported seems to happen because utf-8 bytes in the Description metadata element are not properly being handled in either the unescape or xmlescape python library routines.

In other words, the bug fix we made to escape html in the metadata text fields properly (you can't have html inside the opf xml metadata, dc:description) is now messing up when utf-8 text is used in someplace inside those libraries.

To prove this I made the following change to mobi_opf.py to disable the html escaping.

Code:

--- mobi_opf.py~	2013-01-12 23:40:42.000000000 -0500
+++ mobi_opf.py	2013-03-17 11:38:06.000000000 -0400
@@ -47,7 +47,8 @@
                 for value in metadata[key]:
                     # Strip all tag attributes for the closing tag.
                     closingTag = tag.split(" ")[0]
-                    data.append('<%s>%s</%s>\n' % (tag, xmlescape(self.h.unescape(value)), closingTag))
+                    # data.append('<%s>%s</%s>\n' % (tag, xmlescape(self.h.unescape(value)), closingTag))
+                    data.append('<%s>%s</%s>\n' % (tag, value, closingTag))
                 del metadata[key]
 
         def handleMetaPairs(data, metadata, key, name):

And now it unpacks just fine.

I am not sure how these library routines work but somewhere inside they are assuming the string is ascii or converting it through ascii and this causes the error when the bytestring is in fact utf-8.

So I will have to dig around in those libraries to see how to fix their issues with handling properly encoded bytestrings. The fix may take a while but in the meanwhile you can simply disable the unescaping via the patch above.

Quote:

Originally Posted by nickredding

Kevin - attached is a file that generates this fault.

Thanks for the testcase.

Take care,

KevinH