Best practices: Special characters

virtual_ink · 07-11-2011, 04:04 AM

Is it true that simple characters (single and double quotes, en dash, em dash, ellipsis, etc) will display differently on different e-readers?

I ask this as when previewing my epub's html files in Firefox, these characters display as bad code (eg: em dash = â€“, single quote = â€˜, sq close = â€™), however they display fine in ADE and on the iPad using iBooks.

For the sake of compatibility across the board, would it be best to convert all characters to their html code equivalent? Is this necessary process? What would be the best way to do it (I'm thinking a find-all/replace-all for each character would be one approach, but perhaps there is a better way).

NB: I am using IDC5.5 to export to epub, but I am guessing this question would apply to others using a different program for epub creation - correct me if I am wrong and I will add an ID to the thread title.

Toxaris · 07-11-2011, 04:39 AM

It should not be necessary. If you define everything to UTF-8 it usually goes well. Of course the character has to be in the font of the reader app...

sourcejedi · 07-11-2011, 04:42 AM

Nope, all those characters are pretty safe. What you're seeing is mojibake. You're using UTF-8, but the browser is decoding it as Latin-1 (ish). This is entirely plausible with epub. Your... content.opf file is serving the HTML files as

application/xhtml+xml; charset=utf-8

but obviously you're not asking your browser to read the OPF file, only the HTML file.

It's possible your browser is defaulting to Latin-1 (ish). In which case, get a better browser to test with. Firefox will auto-detect compliant UTF-8.

The other obvious possibility is that your HTML files are lying. They may contain a <meta> tag which declares it as Latin-1 or similar. (ISO- and a numeric code). Anything that expects XML will ignore that, but browsers which expect HTML will obey it.

Finally, a technical note.

XHTML and HTML are actually different syntaxes. In HTML4 and below, they're technically incompatible, but browser-HTML is compatible. In HTML5, compatibility is possible. In both cases, complying with both HTML and XHTML imposes some extra restrictions. (See "polyglot markup" for the current draft recommendations).

E.g. you're supposed to stick to UTF-8, because that's the default for XML, and the declaration to specify a different encoding is not HTML-compatible. So no going insane and switching to obsolete encodings like UTF-16 :-).

If you want to make life easier for yourself, you'd be better off at least using the EPUBReader extension for firefox. Then you can open the EPUB, firefox will read your OPF file, and it should just work without having to change anything.

Second note: all the characters you mentioned will _display_ correctly, but there's a caveat with em dashes. Most dedicated e-readers are too dumb to break lines at em dashes - so you get very long words, which intefere with justification (assuming you use justification). Some people prefer to avoid them, and use en dashes with spaces instead.

Third note: Apparently IDC5.5 is much better than previous editions, but people still end up having to look carefully at & tweak the generated XML. So you may well end up having to fix their code (although I would be surprised if they've managed to screw up basic character encoding for no good reason).

charleski · 07-11-2011, 04:51 AM

Check that the character encoding in Firefox is set to Auto-Detect - Universal, or alternatively UTF-8. If it's set to anything else then it won't render the characters properly. ePub readers should all handle UTF-8 so I wouldn't worry about it.

virtual_ink · 07-11-2011, 10:17 PM

That's a relief, thanks all for your help!

Sourcejedi, I've actually been wondering why my epubs are made up of html files instead of xhtml files. I am exporting from ID, unzipping using Stuffit Expander, and each file has the html extension by default.

If I look at the source, each file starts with:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1 //EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">

Does that ensure I'm working in xhtml, or should my files also be using the xhtml extension?

wannabee · 07-12-2011, 01:31 AM

I had this hassle exporting from IDCS4. All manner of helpful hints from the forum wouldn't solve it. I used notepad++ to locate ALL files in the directory and find and replace. Since using CS5 I haven't had the problem. If you actually track the cause please post it here.

virtual_ink · 07-12-2011, 02:20 AM

So all files should have the xhtml extension?

charleski · 07-12-2011, 02:58 AM

No, the extension doesn't matter. The type of file is set by the doctype definition, which is correctly specifying xhtml version 1.1 with utf-8 encoding. Everything's fine.

sourcejedi · 07-12-2011, 03:29 AM

Actually, what makes them XHTML is the OPF file. Open it up and search for xhtml :-), you'll see what I mean. But yes, there's nothing to worry about. And if you _are_ getting that wrong, it should show up when you run epubcheck, because epub doesn't allow normal html.

I was just trying to figure out why Firefox didn't get the right character encoding. As charleski pointed out, it might have been a configuration issue. But it was worth pointing out that your test with Firefox was out-of-spec.

Unless you're specifically trying to produce "polyglot" markup that works as both syntaxes, for some peculiar reason. Right now, you're using markup which only works in XHTML (again, this is fine for epub) -

<?xml version="1.0" encoding="UTF-8" standalone="no"?>

(but for this specific issue, firefox _should_ autodetect the character encoding anyway, unless there's another problem, or firefox is misconfigured).

07-11-2011, 04:04 AM	#1
virtual_ink Zealot Posts: 107 Karma: 1000 Join Date: Sep 2010 Location: Melbourne, Australia Device: iPad2, Kindle	Best practices: Special characters Is it true that simple characters (single and double quotes, en dash, em dash, ellipsis, etc) will display differently on different e-readers? I ask this as when previewing my epub's html files in Firefox, these characters display as bad code (eg: em dash = â€“, single quote = â€˜, sq close = â€™), however they display fine in ADE and on the iPad using iBooks. For the sake of compatibility across the board, would it be best to convert all characters to their html code equivalent? Is this necessary process? What would be the best way to do it (I'm thinking a find-all/replace-all for each character would be one approach, but perhaps there is a better way). NB: I am using IDC5.5 to export to epub, but I am guessing this question would apply to others using a different program for epub creation - correct me if I am wrong and I will add an ID to the thread title.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Special Characters	abbotrichard	ePub	4	07-01-2011 06:03 PM
Content Special Characters in Collections	bear4hunter	Amazon Kindle	2	08-06-2010 07:11 PM
REFERENCE: Special Characters	nrapallo	IMP	2	04-07-2008 01:29 PM
Special Characters / Fonts	Gatton	IMP	4	03-21-2008 01:43 AM
Special Characters in Plucker	Eroica	Reading and Management	4	11-15-2007 11:22 AM

07-11-2011, 04:39 AM	#2
Toxaris Wizard Posts: 4,520 Karma: 121692313 Join Date: Oct 2009 Location: Heemskerk, NL Device: PRS-T1, Kobo Touch, Kobo Aura	It should not be necessary. If you define everything to UTF-8 it usually goes well. Of course the character has to be in the font of the reader app...

07-11-2011, 04:42 AM	#3
sourcejedi Groupie Posts: 155 Karma: 200000 Join Date: Dec 2009 Location: Britania Device: Android	Nope, all those characters are pretty safe. What you're seeing is mojibake. You're using UTF-8, but the browser is decoding it as Latin-1 (ish). This is entirely plausible with epub. Your... content.opf file is serving the HTML files as application/xhtml+xml; charset=utf-8 but obviously you're not asking your browser to read the OPF file, only the HTML file. It's possible your browser is defaulting to Latin-1 (ish). In which case, get a better browser to test with. Firefox will auto-detect compliant UTF-8. The other obvious possibility is that your HTML files are lying. They may contain a <meta> tag which declares it as Latin-1 or similar. (ISO- and a numeric code). Anything that expects XML will ignore that, but browsers which expect HTML will obey it. Finally, a technical note. XHTML and HTML are actually different syntaxes. In HTML4 and below, they're technically incompatible, but browser-HTML is compatible. In HTML5, compatibility is possible. In both cases, complying with both HTML and XHTML imposes some extra restrictions. (See "polyglot markup" for the current draft recommendations). E.g. you're supposed to stick to UTF-8, because that's the default for XML, and the declaration to specify a different encoding is not HTML-compatible. So no going insane and switching to obsolete encodings like UTF-16 :-). If you want to make life easier for yourself, you'd be better off at least using the EPUBReader extension for firefox. Then you can open the EPUB, firefox will read your OPF file, and it should just work without having to change anything. Second note: all the characters you mentioned will _display_ correctly, but there's a caveat with em dashes. Most dedicated e-readers are too dumb to break lines at em dashes - so you get very long words, which intefere with justification (assuming you use justification). Some people prefer to avoid them, and use en dashes with spaces instead. Third note: Apparently IDC5.5 is much better than previous editions, but people still end up having to look carefully at & tweak the generated XML. So you may well end up having to fix their code (although I would be surprised if they've managed to screw up basic character encoding for no good reason).

07-11-2011, 04:51 AM	#4
charleski Wizard Posts: 1,196 Karma: 1281258 Join Date: Sep 2009 Device: PRS-505	Check that the character encoding in Firefox is set to Auto-Detect - Universal, or alternatively UTF-8. If it's set to anything else then it won't render the characters properly. ePub readers should all handle UTF-8 so I wouldn't worry about it.

07-11-2011, 10:17 PM	#5
virtual_ink Zealot Posts: 107 Karma: 1000 Join Date: Sep 2010 Location: Melbourne, Australia Device: iPad2, Kindle	That's a relief, thanks all for your help! Sourcejedi, I've actually been wondering why my epubs are made up of html files instead of xhtml files. I am exporting from ID, unzipping using Stuffit Expander, and each file has the html extension by default. If I look at the source, each file starts with: <?xml version="1.0" encoding="UTF-8" standalone="no"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1 //EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> Does that ensure I'm working in xhtml, or should my files also be using the xhtml extension?

07-12-2011, 01:31 AM	#6
wannabee Media Bloke Posts: 2,382 Karma: 113956855 Join Date: Sep 2010 Location: NSW - Australia Device: iOS	I had this hassle exporting from IDCS4. All manner of helpful hints from the forum wouldn't solve it. I used notepad++ to locate ALL files in the directory and find and replace. Since using CS5 I haven't had the problem. If you actually track the cause please post it here.

07-12-2011, 02:20 AM	#7
virtual_ink Zealot Posts: 107 Karma: 1000 Join Date: Sep 2010 Location: Melbourne, Australia Device: iPad2, Kindle	So all files should have the xhtml extension?

07-12-2011, 02:58 AM	#8
charleski Wizard Posts: 1,196 Karma: 1281258 Join Date: Sep 2009 Device: PRS-505	No, the extension doesn't matter. The type of file is set by the doctype definition, which is correctly specifying xhtml version 1.1 with utf-8 encoding. Everything's fine.

07-12-2011, 03:29 AM	#9
sourcejedi Groupie Posts: 155 Karma: 200000 Join Date: Dec 2009 Location: Britania Device: Android	Actually, what makes them XHTML is the OPF file. Open it up and search for xhtml :-), you'll see what I mean. But yes, there's nothing to worry about. And if you _are_ getting that wrong, it should show up when you run epubcheck, because epub doesn't allow normal html. I was just trying to figure out why Firefox didn't get the right character encoding. As charleski pointed out, it might have been a configuration issue. But it was worth pointing out that your test with Firefox was out-of-spec. Unless you're specifically trying to produce "polyglot" markup that works as both syntaxes, for some peculiar reason. Right now, you're using markup which only works in XHTML (again, this is fine for epub) - <?xml version="1.0" encoding="UTF-8" standalone="no"?> (but for this specific issue, firefox _should_ autodetect the character encoding anyway, unless there's another problem, or firefox is misconfigured).

Advert

Advert