Quote:
Originally Posted by KevinH
I'm out of town and out of touch for the next 10 days or so. When I get back, I will download the dictionary and try to reproduce the issue. ORDT sections like that represent a byte mapping of one character encoding into another, typically multi-byte. I have seen this issue in some sample ebooks. It is caused by the generating machine using a strange charset like 65002 versus the more typical 65001 (utf-8).
If you want to play around looke in the mobi_index.py file for the strings horde and ORDT. As the code comments say, there are two ORDT provided but we could only figure out what the second one was for. We may now figure it out from your testcase.
KevinH
|
Thank you for paying attention on my issue. I am now try to understand the non western character encoding pattern.
Thank you.
For now, I figure out:
1.Every character has
2 bytes index
2.For
western letters it should be like
00 XX ,for example, 'a' is 00 03, 'b' is 00 64, and look up the table ORDT:
ORDT[3*2+1] is 'a'
ORDT[64*2+1] is 'b'
3.For
non western letters, it should be like
XX XX, for example, '潘' is 6F 58, and in python:
Code:
print u"\u6F58" # is exactly the character '潘'