View Single Post
Old 08-02-2014, 02:06 AM   #944
lglgaigogo
Junior Member
lglgaigogo began at the beginning.
 
Posts: 2
Karma: 10
Join Date: Aug 2014
Device: kindle paper white
Quote:
Originally Posted by KevinH View Post
I'm out of town and out of touch for the next 10 days or so. When I get back, I will download the dictionary and try to reproduce the issue. ORDT sections like that represent a byte mapping of one character encoding into another, typically multi-byte. I have seen this issue in some sample ebooks. It is caused by the generating machine using a strange charset like 65002 versus the more typical 65001 (utf-8).

If you want to play around looke in the mobi_index.py file for the strings horde and ORDT. As the code comments say, there are two ORDT provided but we could only figure out what the second one was for. We may now figure it out from your testcase.

KevinH
Thank you for paying attention on my issue. I am now try to understand the non western character encoding pattern.
Thank you.

For now, I figure out:

1.Every character has 2 bytes index
2.For western letters it should be like 00 XX ,for example, 'a' is 00 03, 'b' is 00 64, and look up the table ORDT:
ORDT[3*2+1] is 'a'
ORDT[64*2+1] is 'b'
3.For non western letters, it should be like XX XX, for example, '潘' is 6F 58, and in python:
Code:
 print u"\u6F58" # is exactly the character '潘'

Last edited by lglgaigogo; 08-04-2014 at 10:38 AM.
lglgaigogo is offline   Reply With Quote