| 
			
			 | 
		#1 | 
| 
			
			
			
			 Wizard 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,004 
				Karma: 177841 
				Join Date: Dec 2009 
				
				
				
				Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T 
				
				
				 | 
	
	
	
		
		
			
			 
				
				Character encoding, hex, emdash, and the meaning of life.
			 
			
			
			I've got an EPUB that is allegedly UTF-8 encoded.  I see that in the html.  When viewed in a hex editor, the characters in each html file are separated by null (0x00).  An ASCII "3" in the file appears as 00 33 (hex). 
		
	
		
		
		
		
		
		
		
		
		
		
	
	The file displays OK. Even the smart quotes appear fine. It's just the emdash 0x97, which in the file appears as 00 97, that doesn't display. (There may be others, but i can't find them). When the epub is opened, the Calibre viewer ignores the emdash. Nothing is displayed at that point, so wordsareconcatenatedlikethis. When I view it in my Android readers, one program places the unknown character symbol there, another ignores it like the Calibre viewer. So I'd like some expert comments. First, from the hex, the labeled character encoding of UTF-8 looks wrong to me. I thought UTF-8 was variable length and used single bytes for ASCII characters, not double bytes with every ASCII byte preceded by a null? Comments? I know there are some old character encodings that are fixed length - UCS-2. Can anyone comment on the right encoding I should try? I've tried UTF-8, 16, and various CP encodings, but I can't seem to get the emdash to display. So what encoding is this? Also, is there a list somewhere of encodings I can specify that Calibre recognizes?  | 
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#2 | 
| 
			
			
			
			 creator of calibre 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,609 
				Karma: 28549044 
				Join Date: Oct 2006 
				Location: Mumbai, India 
				
				
				Device: Various 
				
				
				 | 
	
	
	
		
		
		
		
		 
			
			http://docs.python.org/library/codec...dard-encodings 
		
	
		
		
		
		
		
		
		
		
		
		
	
	That looks like your HTML file has a mix of encodings. Some program in the past converted a single byte encoding to UTF-16 by blindly copying the single byte into the lower UTF-16 byte. I'd guess the encoding for your file is cp1252 once you strip out all 0 bytes.  | 
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| Advert | |
| 
         | 
    
| 
			
			 | 
		#3 | |
| 
			
			
			
			 Wizard 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,004 
				Karma: 177841 
				Join Date: Dec 2009 
				
				
				
				Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T 
				
				
				 | 
	
	
	
		
		
		
		
		 Quote: 
	
   ) 0x20 0x14 is the UTF-16 emdash code, and that worked fine, while keeping the null bytes, even though all the code was marked in the text as UTF-8.  I also tried inserting 0xE2 0x80 0x94, which is the UTF-8 emdash code, but I couldn't figure out how to change the other bytes to get them correct. I'm still not totally clear how this file was treated by Calibre - UTF-8 or 16? The file has two bytes per character, stored little endian. The first two bytes in the file are 0xFF 0xFE, which IIRC, marks it as Unicode. I think my changes made it conform to UTF-16, while all the text in the files, as in the html header and css file states it is encoded UTF-8. Did Calibre recognize the file as UTF-16 and handle it as such? Or are UTF-8 and 16 somehow the same for these characters? IOW, are there duplicate character encodings in UTF-8 such that 2014 is also an emdash in both UTF-8 and UTF-16 and the two bytes (null plus ASCII) are also a valid encoding in UTF-8? I was unable to find any characters other than the emdash that needed special handling. (Based on the comments on stripping out the null bytes, I assume that UTF-8 will usually not use two bytes for normal ASCII bytes, but I've also read that to enhance adoption of UTF-8 there are lots of duplicate encodings) Does anyone want to help clarify this?  | 
|
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#4 | 
| 
			
			
			
			 creator of calibre 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,609 
				Karma: 28549044 
				Join Date: Oct 2006 
				Location: Mumbai, India 
				
				
				Device: Various 
				
				
				 | 
	
	
	
		
		
		
		
		 
			
			UTF-8 does not use null bytes. That file is almost certainly being processed by calibre as utf-16
		 
		
	
		
		
		
		
		
		
		
		
		
		
	
	 | 
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#5 | |
| 
			
			
			
			 Wizard 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,004 
				Karma: 177841 
				Join Date: Dec 2009 
				
				
				
				Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T 
				
				
				 | 
	
	
	
		
		
		
		
		 Quote: 
	
 I'm just guessing, but I suspect that they see the leading 0xFF 0xFE bytes in the file, then the null bytes and say: "Aha! UTF-16", despite the declarations of UTF-8.  | 
|
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| Advert | |
| 
         | 
    
| 
			
			 | 
		#6 | 
| 
			
			
			
			 creator of calibre 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,609 
				Karma: 28549044 
				Join Date: Oct 2006 
				Location: Mumbai, India 
				
				
				Device: Various 
				
				
				 | 
	
	
	
		
		
		
		
		 
			
			when detecting encodings in html files calibre respects the BOM (byte order mark) over declared encodings. So if your html files start with a UTF-16 BOM, the encoding used will be utf-16
		 
		
	
		
		
		
		
		
		
		
		
		
		
	
	 | 
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#7 | |
| 
			
			
			
			 Wizard 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,004 
				Karma: 177841 
				Join Date: Dec 2009 
				
				
				
				Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T 
				
				
				 | 
	
	
	
		
		
		
		
		 Quote: 
	
 Again, thanks.  
		 | 
|
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#8 | 
| 
			
			
			
			 Reader 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 520 
				Karma: 24612 
				Join Date: Aug 2009 
				Location: Utrecht, NL 
				
				
				Device: Kobo Aura 2, iPhone, iPad 
				
				
				 | 
	
	
	
		
		
		
		
		 
			
			Yes, the file apparently is encoded in UTF-16. However, 00 97 is not a valid code. You will have to replace it with the proper code (20 14). It seems somebody used the cp1252 code for emdash and thought it would function as a UTF-16 code. Such things happen a lot on MS systems (especially in email, BTW).
		 
		
	
		
		
		
		
		
		
		
		
		
		
	
	 | 
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#9 | |
| 
			
			
			
			 Wizard 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,004 
				Karma: 177841 
				Join Date: Dec 2009 
				
				
				
				Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T 
				
				
				 | 
	
	
	
		
		
		
		
		 Quote: 
	
  | 
|
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
![]()  | 
            
        
    
            
  | 
    
			 
			Similar Threads
		 | 
	||||
| Thread | Thread Starter | Forum | Replies | Last Post | 
| Pdf to epub Turkish character encoding problem | blueresistance | Conversion | 1 | 02-25-2011 06:31 PM | 
| how to tell the character encoding??? | rheostaticsfan | Calibre | 23 | 06-21-2010 04:26 PM | 
| Encoding of Emdash | crutledge | Workshop | 10 | 10-27-2009 09:31 PM | 
| Character encoding in the filesystem | Jellby | Bookeen | 1 | 03-30-2008 06:36 AM | 
| FBReader fixes character encoding problem | jbenny | News | 1 | 10-18-2007 11:50 PM |