MobileRead Forums - View Single Post - Character encoding, hex, emdash, and the meaning of life.

Starson17 · 08-13-2011, 02:01 PM

I've got an EPUB that is allegedly UTF-8 encoded. I see that in the html. When viewed in a hex editor, the characters in each html file are separated by null (0x00). An ASCII "3" in the file appears as 00 33 (hex).

The file displays OK. Even the smart quotes appear fine. It's just the emdash 0x97, which in the file appears as 00 97, that doesn't display. (There may be others, but i can't find them).

When the epub is opened, the Calibre viewer ignores the emdash. Nothing is displayed at that point, so wordsareconcatenatedlikethis. When I view it in my Android readers, one program places the unknown character symbol there, another ignores it like the Calibre viewer.

So I'd like some expert comments. First, from the hex, the labeled character encoding of UTF-8 looks wrong to me. I thought UTF-8 was variable length and used single bytes for ASCII characters, not double bytes with every ASCII byte preceded by a null? Comments?

I know there are some old character encodings that are fixed length - UCS-2. Can anyone comment on the right encoding I should try? I've tried UTF-8, 16, and various CP encodings, but I can't seem to get the emdash to display. So what encoding is this?

Also, is there a list somewhere of encodings I can specify that Calibre recognizes?

08-13-2011, 02:01 PM	#1
Starson17 Wizard Posts: 4,004 Karma: 177841 Join Date: Dec 2009 Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T	Character encoding, hex, emdash, and the meaning of life. I've got an EPUB that is allegedly UTF-8 encoded. I see that in the html. When viewed in a hex editor, the characters in each html file are separated by null (0x00). An ASCII "3" in the file appears as 00 33 (hex). The file displays OK. Even the smart quotes appear fine. It's just the emdash 0x97, which in the file appears as 00 97, that doesn't display. (There may be others, but i can't find them). When the epub is opened, the Calibre viewer ignores the emdash. Nothing is displayed at that point, so wordsareconcatenatedlikethis. When I view it in my Android readers, one program places the unknown character symbol there, another ignores it like the Calibre viewer. So I'd like some expert comments. First, from the hex, the labeled character encoding of UTF-8 looks wrong to me. I thought UTF-8 was variable length and used single bytes for ASCII characters, not double bytes with every ASCII byte preceded by a null? Comments? I know there are some old character encodings that are fixed length - UCS-2. Can anyone comment on the right encoding I should try? I've tried UTF-8, 16, and various CP encodings, but I can't seem to get the emdash to display. So what encoding is this? Also, is there a list somewhere of encodings I can specify that Calibre recognizes?