MobileRead Forums - View Single Post - Character encoding, hex, emdash, and the meaning of life.

Starson17 · 08-15-2011, 10:56 AM

Quote:

Originally Posted by kovidgoyal

That looks like your HTML file has a mix of encodings. Some program in the past converted a single byte encoding to UTF-16 by blindly copying the single byte into the lower UTF-16 byte.

I'd guess the encoding for your file is cp1252 once you strip out all 0 bytes.

I played around with stripping null bytes, but ultimately, what worked best was to hex edit replace 0x97 0x00 with 0x14 0x20 (when I got the byte ordering correct.

) 0x20 0x14 is the UTF-16 emdash code, and that worked fine, while keeping the null bytes, even though all the code was marked in the text as UTF-8.

I also tried inserting 0xE2 0x80 0x94, which is the UTF-8 emdash code, but I couldn't figure out how to change the other bytes to get them correct.

I'm still not totally clear how this file was treated by Calibre - UTF-8 or 16? The file has two bytes per character, stored little endian. The first two bytes in the file are 0xFF 0xFE, which IIRC, marks it as Unicode. I think my changes made it conform to UTF-16, while all the text in the files, as in the html header and css file states it is encoded UTF-8.

Did Calibre recognize the file as UTF-16 and handle it as such? Or are UTF-8 and 16 somehow the same for these characters? IOW, are there duplicate character encodings in UTF-8 such that 2014 is also an emdash in both UTF-8 and UTF-16 and the two bytes (null plus ASCII) are also a valid encoding in UTF-8?

I was unable to find any characters other than the emdash that needed special handling.

(Based on the comments on stripping out the null bytes, I assume that UTF-8 will usually not use two bytes for normal ASCII bytes, but I've also read that to enhance adoption of UTF-8 there are lots of duplicate encodings)

Does anyone want to help clarify this?