Character encoding, hex, emdash, and the meaning of life.

Starson17 · 08-13-2011, 02:01 PM

I've got an EPUB that is allegedly UTF-8 encoded. I see that in the html. When viewed in a hex editor, the characters in each html file are separated by null (0x00). An ASCII "3" in the file appears as 00 33 (hex).

The file displays OK. Even the smart quotes appear fine. It's just the emdash 0x97, which in the file appears as 00 97, that doesn't display. (There may be others, but i can't find them).

When the epub is opened, the Calibre viewer ignores the emdash. Nothing is displayed at that point, so wordsareconcatenatedlikethis. When I view it in my Android readers, one program places the unknown character symbol there, another ignores it like the Calibre viewer.

So I'd like some expert comments. First, from the hex, the labeled character encoding of UTF-8 looks wrong to me. I thought UTF-8 was variable length and used single bytes for ASCII characters, not double bytes with every ASCII byte preceded by a null? Comments?

I know there are some old character encodings that are fixed length - UCS-2. Can anyone comment on the right encoding I should try? I've tried UTF-8, 16, and various CP encodings, but I can't seem to get the emdash to display. So what encoding is this?

Also, is there a list somewhere of encodings I can specify that Calibre recognizes?

kovidgoyal · 08-13-2011, 02:06 PM

http://docs.python.org/library/codec...dard-encodings

That looks like your HTML file has a mix of encodings. Some program in the past converted a single byte encoding to UTF-16 by blindly copying the single byte into the lower UTF-16 byte.

I'd guess the encoding for your file is cp1252 once you strip out all 0 bytes.

Starson17 · 08-15-2011, 10:56 AM

Quote:

Originally Posted by kovidgoyal

That looks like your HTML file has a mix of encodings. Some program in the past converted a single byte encoding to UTF-16 by blindly copying the single byte into the lower UTF-16 byte.

I'd guess the encoding for your file is cp1252 once you strip out all 0 bytes.

I played around with stripping null bytes, but ultimately, what worked best was to hex edit replace 0x97 0x00 with 0x14 0x20 (when I got the byte ordering correct.

) 0x20 0x14 is the UTF-16 emdash code, and that worked fine, while keeping the null bytes, even though all the code was marked in the text as UTF-8.

I also tried inserting 0xE2 0x80 0x94, which is the UTF-8 emdash code, but I couldn't figure out how to change the other bytes to get them correct.

I'm still not totally clear how this file was treated by Calibre - UTF-8 or 16? The file has two bytes per character, stored little endian. The first two bytes in the file are 0xFF 0xFE, which IIRC, marks it as Unicode. I think my changes made it conform to UTF-16, while all the text in the files, as in the html header and css file states it is encoded UTF-8.

Did Calibre recognize the file as UTF-16 and handle it as such? Or are UTF-8 and 16 somehow the same for these characters? IOW, are there duplicate character encodings in UTF-8 such that 2014 is also an emdash in both UTF-8 and UTF-16 and the two bytes (null plus ASCII) are also a valid encoding in UTF-8?

I was unable to find any characters other than the emdash that needed special handling.

(Based on the comments on stripping out the null bytes, I assume that UTF-8 will usually not use two bytes for normal ASCII bytes, but I've also read that to enhance adoption of UTF-8 there are lots of duplicate encodings)

Does anyone want to help clarify this?

kovidgoyal · 08-15-2011, 11:33 AM

UTF-8 does not use null bytes. That file is almost certainly being processed by calibre as utf-16

Starson17 · 08-15-2011, 01:53 PM

Quote:

Originally Posted by kovidgoyal

UTF-8 does not use null bytes. That file is almost certainly being processed by calibre as utf-16

Thanks. That makes sense, then. AFAICT, it is perfect UTF-16 now, but I just wasn't sure what happened when Calibre saw all the UTF-8 encoding declarations in a UTF-16 encoded file. It now displays correctly in Calibre's viewer, and in all my various Android ebook readers.

I'm just guessing, but I suspect that they see the leading 0xFF 0xFE bytes in the file, then the null bytes and say: "Aha! UTF-16", despite the declarations of UTF-8.

kovidgoyal · 08-15-2011, 01:59 PM

when detecting encodings in html files calibre respects the BOM (byte order mark) over declared encodings. So if your html files start with a UTF-16 BOM, the encoding used will be utf-16

Starson17 · 08-15-2011, 02:40 PM

Quote:

Originally Posted by kovidgoyal

when detecting encodings in html files calibre respects the BOM (byte order mark) over declared encodings. So if your html files start with a UTF-16 BOM, the encoding used will be utf-16

Yes, the file starts with a BOM, and looking it up, the BOM 0xFF 0xFE means UTF-16 little endian, which is exactly how it is encoded. I can also see what I did wrong when I tried to switch to UTF-8 - I had the wrong BOM in the data stream. I should have changed 0xFF 0xFE to 0xEF 0xBB 0xBF.

Again, thanks.

pietvo · 08-18-2011, 04:14 PM

Yes, the file apparently is encoded in UTF-16. However, 00 97 is not a valid code. You will have to replace it with the proper code (20 14). It seems somebody used the cp1252 code for emdash and thought it would function as a UTF-16 code. Such things happen a lot on MS systems (especially in email, BTW).

Starson17 · 08-18-2011, 04:25 PM

Quote:

Originally Posted by pietvo

Yes, the file apparently is encoded in UTF-16. However, 00 97 is not a valid code. You will have to replace it with the proper code (20 14). It seems somebody used the cp1252 code for emdash and thought it would function as a UTF-16 code. Such things happen a lot on MS systems (especially in email, BTW).

You are a bit late, but thank you anyway. Yes, replacing 97 00 with 14 20 in the hex editor turned it into valid UTF-16. (The BOM indicated little endian byte order).

08-13-2011, 02:01 PM	#1
Starson17 Wizard Posts: 4,004 Karma: 177841 Join Date: Dec 2009 Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T	Character encoding, hex, emdash, and the meaning of life. I've got an EPUB that is allegedly UTF-8 encoded. I see that in the html. When viewed in a hex editor, the characters in each html file are separated by null (0x00). An ASCII "3" in the file appears as 00 33 (hex). The file displays OK. Even the smart quotes appear fine. It's just the emdash 0x97, which in the file appears as 00 97, that doesn't display. (There may be others, but i can't find them). When the epub is opened, the Calibre viewer ignores the emdash. Nothing is displayed at that point, so wordsareconcatenatedlikethis. When I view it in my Android readers, one program places the unknown character symbol there, another ignores it like the Calibre viewer. So I'd like some expert comments. First, from the hex, the labeled character encoding of UTF-8 looks wrong to me. I thought UTF-8 was variable length and used single bytes for ASCII characters, not double bytes with every ASCII byte preceded by a null? Comments? I know there are some old character encodings that are fixed length - UCS-2. Can anyone comment on the right encoding I should try? I've tried UTF-8, 16, and various CP encodings, but I can't seem to get the emdash to display. So what encoding is this? Also, is there a list somewhere of encodings I can specify that Calibre recognizes?

Thread Tools	Search this Thread
Show Printable Version Email this Page	Search this Thread: Advanced Search

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Pdf to epub Turkish character encoding problem	blueresistance	Conversion	1	02-25-2011 05:31 PM
how to tell the character encoding???	rheostaticsfan	Calibre	23	06-21-2010 03:26 PM
Encoding of Emdash	crutledge	Workshop	10	10-27-2009 08:31 PM
Character encoding in the filesystem	Jellby	Bookeen	1	03-30-2008 05:36 AM
FBReader fixes character encoding problem	jbenny	News	1	10-18-2007 10:50 PM

08-13-2011, 02:06 PM	#2
kovidgoyal creator of calibre Posts: 43,830 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	http://docs.python.org/library/codec...dard-encodings That looks like your HTML file has a mix of encodings. Some program in the past converted a single byte encoding to UTF-16 by blindly copying the single byte into the lower UTF-16 byte. I'd guess the encoding for your file is cp1252 once you strip out all 0 bytes.

08-15-2011, 11:33 AM	#4
kovidgoyal creator of calibre Posts: 43,830 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	UTF-8 does not use null bytes. That file is almost certainly being processed by calibre as utf-16

08-15-2011, 01:59 PM	#6
kovidgoyal creator of calibre Posts: 43,830 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	when detecting encodings in html files calibre respects the BOM (byte order mark) over declared encodings. So if your html files start with a UTF-16 BOM, the encoding used will be utf-16

08-18-2011, 04:14 PM	#8
pietvo Reader Posts: 519 Karma: 24612 Join Date: Aug 2009 Location: Utrecht, NL Device: Kobo Aura 2, iPhone, iPad	Yes, the file apparently is encoded in UTF-16. However, 00 97 is not a valid code. You will have to replace it with the proper code (20 14). It seems somebody used the cp1252 code for emdash and thought it would function as a UTF-16 code. Such things happen a lot on MS systems (especially in email, BTW).