MobileRead Forums - View Single Post - KindleUnpack (MobiUnpack): Extracts text, images and metadata from Kindle/Mobi files

lglgaigogo · 08-02-2014, 02:06 AM

Quote:

Originally Posted by KevinH

I'm out of town and out of touch for the next 10 days or so. When I get back, I will download the dictionary and try to reproduce the issue. ORDT sections like that represent a byte mapping of one character encoding into another, typically multi-byte. I have seen this issue in some sample ebooks. It is caused by the generating machine using a strange charset like 65002 versus the more typical 65001 (utf-8).

If you want to play around looke in the mobi_index.py file for the strings horde and ORDT. As the code comments say, there are two ORDT provided but we could only figure out what the second one was for. We may now figure it out from your testcase.

KevinH

Thank you for paying attention on my issue. I am now try to understand the non western character encoding pattern.
Thank you.

For now, I figure out:

1.Every character has 2 bytes index
2.For western letters it should be like 00 XX ,for example, 'a' is 00 03, 'b' is 00 64, and look up the table ORDT:
ORDT[3*2+1] is 'a'
ORDT[64*2+1] is 'b'
3.For non western letters, it should be like XX XX, for example, '潘' is 6F 58, and in python：

Code:

 print u"\u6F58" # is exactly the character '潘'