MobileRead Forums - View Single Post

sourcejedi · 07-11-2011, 04:42 AM

Nope, all those characters are pretty safe. What you're seeing is mojibake. You're using UTF-8, but the browser is decoding it as Latin-1 (ish). This is entirely plausible with epub. Your... content.opf file is serving the HTML files as

application/xhtml+xml; charset=utf-8

but obviously you're not asking your browser to read the OPF file, only the HTML file.

It's possible your browser is defaulting to Latin-1 (ish). In which case, get a better browser to test with. Firefox will auto-detect compliant UTF-8.

The other obvious possibility is that your HTML files are lying. They may contain a <meta> tag which declares it as Latin-1 or similar. (ISO- and a numeric code). Anything that expects XML will ignore that, but browsers which expect HTML will obey it.

Finally, a technical note.

XHTML and HTML are actually different syntaxes. In HTML4 and below, they're technically incompatible, but browser-HTML is compatible. In HTML5, compatibility is possible. In both cases, complying with both HTML and XHTML imposes some extra restrictions. (See "polyglot markup" for the current draft recommendations).

E.g. you're supposed to stick to UTF-8, because that's the default for XML, and the declaration to specify a different encoding is not HTML-compatible. So no going insane and switching to obsolete encodings like UTF-16 :-).

If you want to make life easier for yourself, you'd be better off at least using the EPUBReader extension for firefox. Then you can open the EPUB, firefox will read your OPF file, and it should just work without having to change anything.

Second note: all the characters you mentioned will _display_ correctly, but there's a caveat with em dashes. Most dedicated e-readers are too dumb to break lines at em dashes - so you get very long words, which intefere with justification (assuming you use justification). Some people prefer to avoid them, and use en dashes with spaces instead.

Third note: Apparently IDC5.5 is much better than previous editions, but people still end up having to look carefully at & tweak the generated XML. So you may well end up having to fix their code (although I would be surprised if they've managed to screw up basic character encoding for no good reason).

07-11-2011, 04:42 AM	#3
sourcejedi Groupie Posts: 155 Karma: 200000 Join Date: Dec 2009 Location: Britania Device: Android	Nope, all those characters are pretty safe. What you're seeing is mojibake. You're using UTF-8, but the browser is decoding it as Latin-1 (ish). This is entirely plausible with epub. Your... content.opf file is serving the HTML files as application/xhtml+xml; charset=utf-8 but obviously you're not asking your browser to read the OPF file, only the HTML file. It's possible your browser is defaulting to Latin-1 (ish). In which case, get a better browser to test with. Firefox will auto-detect compliant UTF-8. The other obvious possibility is that your HTML files are lying. They may contain a <meta> tag which declares it as Latin-1 or similar. (ISO- and a numeric code). Anything that expects XML will ignore that, but browsers which expect HTML will obey it. Finally, a technical note. XHTML and HTML are actually different syntaxes. In HTML4 and below, they're technically incompatible, but browser-HTML is compatible. In HTML5, compatibility is possible. In both cases, complying with both HTML and XHTML imposes some extra restrictions. (See "polyglot markup" for the current draft recommendations). E.g. you're supposed to stick to UTF-8, because that's the default for XML, and the declaration to specify a different encoding is not HTML-compatible. So no going insane and switching to obsolete encodings like UTF-16 :-). If you want to make life easier for yourself, you'd be better off at least using the EPUBReader extension for firefox. Then you can open the EPUB, firefox will read your OPF file, and it should just work without having to change anything. Second note: all the characters you mentioned will _display_ correctly, but there's a caveat with em dashes. Most dedicated e-readers are too dumb to break lines at em dashes - so you get very long words, which intefere with justification (assuming you use justification). Some people prefer to avoid them, and use en dashes with spaces instead. Third note: Apparently IDC5.5 is much better than previous editions, but people still end up having to look carefully at & tweak the generated XML. So you may well end up having to fix their code (although I would be surprised if they've managed to screw up basic character encoding for no good reason).