08-13-2011, 02:01 PM | #1 |
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Character encoding, hex, emdash, and the meaning of life.
I've got an EPUB that is allegedly UTF-8 encoded. I see that in the html. When viewed in a hex editor, the characters in each html file are separated by null (0x00). An ASCII "3" in the file appears as 00 33 (hex).
The file displays OK. Even the smart quotes appear fine. It's just the emdash 0x97, which in the file appears as 00 97, that doesn't display. (There may be others, but i can't find them). When the epub is opened, the Calibre viewer ignores the emdash. Nothing is displayed at that point, so wordsareconcatenatedlikethis. When I view it in my Android readers, one program places the unknown character symbol there, another ignores it like the Calibre viewer. So I'd like some expert comments. First, from the hex, the labeled character encoding of UTF-8 looks wrong to me. I thought UTF-8 was variable length and used single bytes for ASCII characters, not double bytes with every ASCII byte preceded by a null? Comments? I know there are some old character encodings that are fixed length - UCS-2. Can anyone comment on the right encoding I should try? I've tried UTF-8, 16, and various CP encodings, but I can't seem to get the emdash to display. So what encoding is this? Also, is there a list somewhere of encodings I can specify that Calibre recognizes? |
08-13-2011, 02:06 PM | #2 |
creator of calibre
Posts: 43,830
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
http://docs.python.org/library/codec...dard-encodings
That looks like your HTML file has a mix of encodings. Some program in the past converted a single byte encoding to UTF-16 by blindly copying the single byte into the lower UTF-16 byte. I'd guess the encoding for your file is cp1252 once you strip out all 0 bytes. |
08-15-2011, 10:56 AM | #3 | |
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Quote:
I also tried inserting 0xE2 0x80 0x94, which is the UTF-8 emdash code, but I couldn't figure out how to change the other bytes to get them correct. I'm still not totally clear how this file was treated by Calibre - UTF-8 or 16? The file has two bytes per character, stored little endian. The first two bytes in the file are 0xFF 0xFE, which IIRC, marks it as Unicode. I think my changes made it conform to UTF-16, while all the text in the files, as in the html header and css file states it is encoded UTF-8. Did Calibre recognize the file as UTF-16 and handle it as such? Or are UTF-8 and 16 somehow the same for these characters? IOW, are there duplicate character encodings in UTF-8 such that 2014 is also an emdash in both UTF-8 and UTF-16 and the two bytes (null plus ASCII) are also a valid encoding in UTF-8? I was unable to find any characters other than the emdash that needed special handling. (Based on the comments on stripping out the null bytes, I assume that UTF-8 will usually not use two bytes for normal ASCII bytes, but I've also read that to enhance adoption of UTF-8 there are lots of duplicate encodings) Does anyone want to help clarify this? |
|
08-15-2011, 11:33 AM | #4 |
creator of calibre
Posts: 43,830
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
UTF-8 does not use null bytes. That file is almost certainly being processed by calibre as utf-16
|
08-15-2011, 01:53 PM | #5 | |
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Quote:
I'm just guessing, but I suspect that they see the leading 0xFF 0xFE bytes in the file, then the null bytes and say: "Aha! UTF-16", despite the declarations of UTF-8. |
|
08-15-2011, 01:59 PM | #6 |
creator of calibre
Posts: 43,830
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
when detecting encodings in html files calibre respects the BOM (byte order mark) over declared encodings. So if your html files start with a UTF-16 BOM, the encoding used will be utf-16
|
08-15-2011, 02:40 PM | #7 | |
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Quote:
Again, thanks. |
|
08-18-2011, 04:14 PM | #8 |
Reader
Posts: 519
Karma: 24612
Join Date: Aug 2009
Location: Utrecht, NL
Device: Kobo Aura 2, iPhone, iPad
|
Yes, the file apparently is encoded in UTF-16. However, 00 97 is not a valid code. You will have to replace it with the proper code (20 14). It seems somebody used the cp1252 code for emdash and thought it would function as a UTF-16 code. Such things happen a lot on MS systems (especially in email, BTW).
|
08-18-2011, 04:25 PM | #9 | |
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Quote:
|
|
Thread Tools | Search this Thread |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Pdf to epub Turkish character encoding problem | blueresistance | Conversion | 1 | 02-25-2011 05:31 PM |
how to tell the character encoding??? | rheostaticsfan | Calibre | 23 | 06-21-2010 03:26 PM |
Encoding of Emdash | crutledge | Workshop | 10 | 10-27-2009 08:31 PM |
Character encoding in the filesystem | Jellby | Bookeen | 1 | 03-30-2008 05:36 AM |
FBReader fixes character encoding problem | jbenny | News | 1 | 10-18-2007 10:50 PM |