View Single Post
Old 08-15-2011, 10:56 AM   #3
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by kovidgoyal View Post
That looks like your HTML file has a mix of encodings. Some program in the past converted a single byte encoding to UTF-16 by blindly copying the single byte into the lower UTF-16 byte.

I'd guess the encoding for your file is cp1252 once you strip out all 0 bytes.
I played around with stripping null bytes, but ultimately, what worked best was to hex edit replace 0x97 0x00 with 0x14 0x20 (when I got the byte ordering correct. ) 0x20 0x14 is the UTF-16 emdash code, and that worked fine, while keeping the null bytes, even though all the code was marked in the text as UTF-8.

I also tried inserting 0xE2 0x80 0x94, which is the UTF-8 emdash code, but I couldn't figure out how to change the other bytes to get them correct.

I'm still not totally clear how this file was treated by Calibre - UTF-8 or 16? The file has two bytes per character, stored little endian. The first two bytes in the file are 0xFF 0xFE, which IIRC, marks it as Unicode. I think my changes made it conform to UTF-16, while all the text in the files, as in the html header and css file states it is encoded UTF-8.

Did Calibre recognize the file as UTF-16 and handle it as such? Or are UTF-8 and 16 somehow the same for these characters? IOW, are there duplicate character encodings in UTF-8 such that 2014 is also an emdash in both UTF-8 and UTF-16 and the two bytes (null plus ASCII) are also a valid encoding in UTF-8?

I was unable to find any characters other than the emdash that needed special handling.

(Based on the comments on stripping out the null bytes, I assume that UTF-8 will usually not use two bytes for normal ASCII bytes, but I've also read that to enhance adoption of UTF-8 there are lots of duplicate encodings)

Does anyone want to help clarify this?
Starson17 is offline   Reply With Quote