MobileRead Forums - View Single Post

pholy · 05-17-2012, 03:01 PM

The web site appears to be using UTF-8 encoding, so there are rules about what values can occur in each position of a multi-byte sequence. This is so that you can always find the start of a multi-byte sequence even if you get plopped down in the middle of a file.
According to Table 3-6, in Section 3.9 of the Unicode Book (available from www.unicode.org as a set of pdf files) the first (and only) byte of a single byte code must start with a zero bit -ie 0xxxxxxx. For a two-byte code, the first byte is 110yyyyy and the second is 10xxxxxx. For a three byte code, the first byte is 1110zzzz, the second is 10yyyyyy, and the third is 10xxxxxx. The 16 bit code value is form by concatenating the zzzz as high order, then the yy's and finally the xx's with leading zeroes of course. The chart in the book shows it better

You might need a hex editor to find that byte pair, because most text editors deal in lines; but a good unicode editor should point out the problem when it opens the file. Then you can fix it howeveer you choose.
I'm not sure why it is using a 'big5' codec; that's usually used for chinese texts.

edit:
Dang! Beaten to the punch. So much for the long-winded (and still incomplete) explanation.

05-17-2012, 03:01 PM	#3
pholy Booklegger Posts: 1,801 Karma: 7999816 Join Date: Jun 2009 Location: Toronto, Ontario, Canada Device: BeBook(1 & 2010), PEZ, PRS-505, Kobo BT, PRS-T1, Playbook, Kobo Touch	The web site appears to be using UTF-8 encoding, so there are rules about what values can occur in each position of a multi-byte sequence. This is so that you can always find the start of a multi-byte sequence even if you get plopped down in the middle of a file. According to Table 3-6, in Section 3.9 of the Unicode Book (available from www.unicode.org as a set of pdf files) the first (and only) byte of a single byte code must start with a zero bit -ie 0xxxxxxx. For a two-byte code, the first byte is 110yyyyy and the second is 10xxxxxx. For a three byte code, the first byte is 1110zzzz, the second is 10yyyyyy, and the third is 10xxxxxx. The 16 bit code value is form by concatenating the zzzz as high order, then the yy's and finally the xx's with leading zeroes of course. The chart in the book shows it better You might need a hex editor to find that byte pair, because most text editors deal in lines; but a good unicode editor should point out the problem when it opens the file. Then you can fix it howeveer you choose. I'm not sure why it is using a 'big5' codec; that's usually used for chinese texts. edit: Dang! Beaten to the punch. So much for the long-winded (and still incomplete) explanation. Last edited by pholy; 05-17-2012 at 03:03 PM.