Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Conversion

Notices

Reply
 
Thread Tools Search this Thread
Old 08-13-2011, 02:01 PM   #1
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Character encoding, hex, emdash, and the meaning of life.

I've got an EPUB that is allegedly UTF-8 encoded. I see that in the html. When viewed in a hex editor, the characters in each html file are separated by null (0x00). An ASCII "3" in the file appears as 00 33 (hex).

The file displays OK. Even the smart quotes appear fine. It's just the emdash 0x97, which in the file appears as 00 97, that doesn't display. (There may be others, but i can't find them).

When the epub is opened, the Calibre viewer ignores the emdash. Nothing is displayed at that point, so wordsareconcatenatedlikethis. When I view it in my Android readers, one program places the unknown character symbol there, another ignores it like the Calibre viewer.

So I'd like some expert comments. First, from the hex, the labeled character encoding of UTF-8 looks wrong to me. I thought UTF-8 was variable length and used single bytes for ASCII characters, not double bytes with every ASCII byte preceded by a null? Comments?

I know there are some old character encodings that are fixed length - UCS-2. Can anyone comment on the right encoding I should try? I've tried UTF-8, 16, and various CP encodings, but I can't seem to get the emdash to display. So what encoding is this?

Also, is there a list somewhere of encodings I can specify that Calibre recognizes?
Starson17 is offline   Reply With Quote
Old 08-13-2011, 02:06 PM   #2
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,843
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
http://docs.python.org/library/codec...dard-encodings

That looks like your HTML file has a mix of encodings. Some program in the past converted a single byte encoding to UTF-16 by blindly copying the single byte into the lower UTF-16 byte.

I'd guess the encoding for your file is cp1252 once you strip out all 0 bytes.
kovidgoyal is offline   Reply With Quote
Old 08-15-2011, 10:56 AM   #3
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by kovidgoyal View Post
That looks like your HTML file has a mix of encodings. Some program in the past converted a single byte encoding to UTF-16 by blindly copying the single byte into the lower UTF-16 byte.

I'd guess the encoding for your file is cp1252 once you strip out all 0 bytes.
I played around with stripping null bytes, but ultimately, what worked best was to hex edit replace 0x97 0x00 with 0x14 0x20 (when I got the byte ordering correct. ) 0x20 0x14 is the UTF-16 emdash code, and that worked fine, while keeping the null bytes, even though all the code was marked in the text as UTF-8.

I also tried inserting 0xE2 0x80 0x94, which is the UTF-8 emdash code, but I couldn't figure out how to change the other bytes to get them correct.

I'm still not totally clear how this file was treated by Calibre - UTF-8 or 16? The file has two bytes per character, stored little endian. The first two bytes in the file are 0xFF 0xFE, which IIRC, marks it as Unicode. I think my changes made it conform to UTF-16, while all the text in the files, as in the html header and css file states it is encoded UTF-8.

Did Calibre recognize the file as UTF-16 and handle it as such? Or are UTF-8 and 16 somehow the same for these characters? IOW, are there duplicate character encodings in UTF-8 such that 2014 is also an emdash in both UTF-8 and UTF-16 and the two bytes (null plus ASCII) are also a valid encoding in UTF-8?

I was unable to find any characters other than the emdash that needed special handling.

(Based on the comments on stripping out the null bytes, I assume that UTF-8 will usually not use two bytes for normal ASCII bytes, but I've also read that to enhance adoption of UTF-8 there are lots of duplicate encodings)

Does anyone want to help clarify this?
Starson17 is offline   Reply With Quote
Old 08-15-2011, 11:33 AM   #4
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,843
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
UTF-8 does not use null bytes. That file is almost certainly being processed by calibre as utf-16
kovidgoyal is offline   Reply With Quote
Old 08-15-2011, 01:53 PM   #5
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by kovidgoyal View Post
UTF-8 does not use null bytes. That file is almost certainly being processed by calibre as utf-16
Thanks. That makes sense, then. AFAICT, it is perfect UTF-16 now, but I just wasn't sure what happened when Calibre saw all the UTF-8 encoding declarations in a UTF-16 encoded file. It now displays correctly in Calibre's viewer, and in all my various Android ebook readers.

I'm just guessing, but I suspect that they see the leading 0xFF 0xFE bytes in the file, then the null bytes and say: "Aha! UTF-16", despite the declarations of UTF-8.
Starson17 is offline   Reply With Quote
Old 08-15-2011, 01:59 PM   #6
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,843
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
when detecting encodings in html files calibre respects the BOM (byte order mark) over declared encodings. So if your html files start with a UTF-16 BOM, the encoding used will be utf-16
kovidgoyal is offline   Reply With Quote
Old 08-15-2011, 02:40 PM   #7
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by kovidgoyal View Post
when detecting encodings in html files calibre respects the BOM (byte order mark) over declared encodings. So if your html files start with a UTF-16 BOM, the encoding used will be utf-16
Yes, the file starts with a BOM, and looking it up, the BOM 0xFF 0xFE means UTF-16 little endian, which is exactly how it is encoded. I can also see what I did wrong when I tried to switch to UTF-8 - I had the wrong BOM in the data stream. I should have changed 0xFF 0xFE to 0xEF 0xBB 0xBF.

Again, thanks.
Starson17 is offline   Reply With Quote
Old 08-18-2011, 04:14 PM   #8
pietvo
Reader
pietvo can name that song in three notespietvo can name that song in three notespietvo can name that song in three notespietvo can name that song in three notespietvo can name that song in three notespietvo can name that song in three notespietvo can name that song in three notespietvo can name that song in three notespietvo can name that song in three notespietvo can name that song in three notespietvo can name that song in three notes
 
pietvo's Avatar
 
Posts: 519
Karma: 24612
Join Date: Aug 2009
Location: Utrecht, NL
Device: Kobo Aura 2, iPhone, iPad
Yes, the file apparently is encoded in UTF-16. However, 00 97 is not a valid code. You will have to replace it with the proper code (20 14). It seems somebody used the cp1252 code for emdash and thought it would function as a UTF-16 code. Such things happen a lot on MS systems (especially in email, BTW).
pietvo is offline   Reply With Quote
Old 08-18-2011, 04:25 PM   #9
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by pietvo View Post
Yes, the file apparently is encoded in UTF-16. However, 00 97 is not a valid code. You will have to replace it with the proper code (20 14). It seems somebody used the cp1252 code for emdash and thought it would function as a UTF-16 code. Such things happen a lot on MS systems (especially in email, BTW).
You are a bit late, but thank you anyway. Yes, replacing 97 00 with 14 20 in the hex editor turned it into valid UTF-16. (The BOM indicated little endian byte order).
Starson17 is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Pdf to epub Turkish character encoding problem blueresistance Conversion 1 02-25-2011 05:31 PM
how to tell the character encoding??? rheostaticsfan Calibre 23 06-21-2010 03:26 PM
Encoding of Emdash crutledge Workshop 10 10-27-2009 08:31 PM
Character encoding in the filesystem Jellby Bookeen 1 03-30-2008 05:36 AM
FBReader fixes character encoding problem jbenny News 1 10-18-2007 10:50 PM


All times are GMT -4. The time now is 01:00 PM.


MobileRead.com is a privately owned, operated and funded community.