View Full Version : accents and entities in an epub


Steubie
12-13-2012, 11:56 AM
Currently I am finishing an index and ebook of a scholarly work -- seven languages, 467 footnotes, etc. The only remaining item prior to customer acceptance is to get the accented characters in French and German showing correctly.

My standard opening for XML files is:
<?xml version="1.0" encoding="utf-8" standalone="no"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"><html xmlns="http://www.w3.org/1999/xhtml">

This standard opening leaves all of the &Eacute; and &ocirc;, etc. showing in the text.

I have tried following up the DOCTPE lines above with
<!ENTITY HTMLlat1 PUBLIC
"-//W3C//ENTITIES Latin 1 for XHTML//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent">

Flight Crew and EPubCheck are very unhappy with this.

Any suggestions?

Toxaris
12-13-2012, 01:54 PM
If your file is encoded in UTF (so not only the declaration), it should not be a problem. Are you saying that the HTML entities remain in the rendered text or in the code? The second is no issue, the first one quite peculiar.
Also don't touch the DOCTYPE.

Steubie
12-13-2012, 03:36 PM
The file is entirely printable ASCII characters 0x20 through 0x7e along with 0x0a. This lets me create and manipulate data in a word processor.

The rendered text looks like this example: "Hippolyte Hemmer, Cl&eacute;ment de Rome: &Eacute;p&icirc;tre aux Corinthiens..." etc.

DiapDealer
12-13-2012, 04:21 PM
Check to be certain those entities aren't being xml escaped. If you entered (pasted, typed, whatever...) all the data from your Word Processor document into a WYSIWYG editor such as Sigil's Book View, that's likely to happen. Entities need to be pasted/typed into Code View (speaking strictly about Sigil here)... because they're, well... code. ;) Otherwise &Eacute; becomes &amp;Eacute;. Just like <p> becomes &lt;p&gt;. What, if anything, are you using to build/create the ePub from your word processor document?

Steubie
12-14-2012, 07:37 AM
I posted a response yesterday, but do not see it here.

DiapDealer -- You were correct. My MakeEpub program was changing ampersands to %amp;. I made changes in the source code. The ebook now shows accents correctly. Many thanks.

Steubie
12-14-2012, 07:38 AM
Correction to type: Make that &amp;

DiapDealer
12-14-2012, 08:13 AM
All's well that ends well. :)

dgatwood
12-19-2012, 05:36 PM
Warning: You should *not* use HTML entities like &eacute;. EPUB is based on XHTML, not HTML, and XHTML does not define any entities other than &amp;, &lt;, &gt;, &apos;. and &quot;—&, <, >, ', and ", respectively.

That means that other HTML entities are not technically legal in an EPUB file, and a reader would be within its rights to barf if it encounters them. You should always replace those entities with proper XML entities, e.g. & #233; or & #xe9; (without the space after the & in both cases, but I can't type them that way because this forum keeps translating them into ) instead of &eacute;.

I originally tried to provide an incomplete list of some common substitutions in the form of Perl regular expressions, but the forum ate those, too. Here's the same list as text.


prime -> #824
Prime -> #8243
ldquo -> #8220
rdquo -> #8221
lsquo -> #8216
rsquo -> #8217
mdash -> #8212. Suggest following this by character #8203 (zero-width space as a wrap hint).
ndash ->#8211. Again, suggest adding a zero-width space afterwards.
copy -> #169
trade -> #8482
deg -> #176
aacute -> #225
eacute -> #233
oacute -> #243
ntilde -> #241
iuml -> #239
ecirc -> #234
nbsp -> #160


For a full list, see http://www.fileformat.info/format/w3c/htmlentity.htm.

Jellby
12-20-2012, 01:56 PM
Warning: You should *not* use HTML entities like &eacute;. EPUB is based on XHTML, not HTML, and XHTML does not define any entities other than &amp;, &lt;, &gt;, &apos;. and &quot;&, <, >, ', and ", respectively.

Are you sure? I think that applies to XML, but XHTML adds some things on top of XML (or enforces XML syntax on HTML), among them, I believe, the definition of a good deal of entities.

See also http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references#E ntities_representing_special_characters_in_XHTML

mrmikel
12-20-2012, 03:47 PM
At w3.org, there is this list of entities:
http://www.w3.org/2000/07/8378/xhtml/entities/entities.xml

DiapDealer
12-20-2012, 04:43 PM
It all depends on the parser and the XHTML DTD. I've never run into an ePub parser that couldn't handle them (assuming proper declarations), but I suppose it's possible. Perhaps someone is confusing xhtml1.1 and ePub2 with xhtml5 and ePub3? Named entites are no longer technically valid in that situation (http://idpf.org/accessibility/guidelines/content/xhtml/entities.php).

dgatwood
12-20-2012, 10:36 PM
Are you sure? I think that applies to XML, but XHTML adds some things on top of XML (or enforces XML syntax on HTML), among them, I believe, the definition of a good deal of entities.

See also http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references#E ntities_representing_special_characters_in_XHTML

Apparently I misremembered. Never mind. :)

SusanM
04-15-2013, 10:31 PM
Kobo requests that you use decimal entities and not character entities. I assume that it would be the same for other retailers.

List of entities
http://www.derby.co.nz/web-development/entities.html

davidfor
04-16-2013, 01:32 AM
Kobo requests that you use decimal entities and not character entities. I assume that it would be the same for other retailers.

Is that for books to be converted to their kepub format?

Toxaris
04-16-2013, 02:19 AM
I haven't seen that request from Kobo and it sounds a bit silly though. It is much easier to type (and remember...) the named HTML entities than their number equivalent.

AlPe
04-18-2013, 05:06 PM
I too confirm that not using named entities is a wise decision, as they are not going along with a couple of Reading Systems.

(Also, named entities can be converted into numeric entities with 5 lines of Python, for example.)

Toxaris
04-19-2013, 02:54 AM
Can you give examples of those Reading Systems?

AlPe
04-19-2013, 04:33 AM
I remember that I personally saw some problems with Onyx devices due to an earlier version of crengine not supporting them; they later fixed the issue. Plus, on a non-RS side, I run into a similar problem with a tool for R/W EPUB in a plugin for a CMS.

I admit: it is not a common problem, but I would rather be on the conservative side, as long as it can be done "cheaply".

SusanM
04-19-2013, 01:12 PM
I haven't seen that request from Kobo and it sounds a bit silly though. It is much easier to type (and remember...) the named HTML entities than their number equivalent.

It is buried in their sketchy vendor guide (attached). Page 5 in one of the lists. I think the reason is once everyone moves to EPUB3 and HTML5, decimal is the standard, not character entities.

SusanM
04-19-2013, 01:14 PM
It is buried in their sketchy vendor guide (attached). Page 5 in one of the lists. I think the reason is once everyone moves to EPUB3 and HTML5, decimal is the standard, not character entities.
I am changing this to post the link rather than the file:
https://duckduckgo.com/?q=vendor%27s_gudie_to_Kobo.pdfhttp://

Thanks for pointing this out!

SusanM
04-19-2013, 01:26 PM
I remember that I personally saw some problems with Onyx devices due to an earlier version of crengine not supporting them; they later fixed the issue. Plus, on a non-RS side, I run into a similar problem with a tool for R/W EPUB in a plugin for a CMS.

I admit: it is not a common problem, but I would rather be on the conservative side, as long as it can be done "cheaply".

Definitely a good practice. I never use character entities.

AlPe
04-19-2013, 01:47 PM
Pedantic comment: in that document, they use the term "decimal entities", but I guess they meant "numeric entities" (i.e., decimal or hexadecimal) --- as I rememeber some of their examples containing hexadecimal entities.

PeterT
04-19-2013, 02:20 PM
Sorry, didn't attach!

I love the fact tht it contains the following:

This document contains confidential and proprietary information of Kobo Inc. Its receipt or possession does not convey any ownership rights therein, or any rights to reproduce or disclose its contents or to manufacture, use, or sell it or anything it may describe. Reproduction, disclosure, or use without specific written authorization of Kobo Inc. is strictly forbidden. Kobo Inc. reserves the right to update this document without notice to its client vendors unless otherwise agreed to.


You sure you have the rights to post this here?

JSWolf
04-19-2013, 02:34 PM
OK, be honest here, who knows the numeric entities for the following...

&amp;
&nbsp;
&mdash;
&lsquo;
&rsquo;
&ldquo;
&rdquo;

Jellby
04-19-2013, 03:34 PM
&rsquo;

#8217

I know that one because I often use it for apostrophes (when I want to make them difference from a real closing single quote).

SusanM
04-26-2013, 11:58 AM
Good point. Specifications tend to use their own terms. I guess decimal and numeric mean the same thing!

SusanM
04-26-2013, 12:01 PM
I love the fact tht it contains the following:


You sure you have the rights to post this here?

Changed to link - thank's for pointing out, Peter.

SusanM
04-26-2013, 12:04 PM
Am I sounding pedantic? Sorry if I am....just wanted to share the pitfalls that I have encountered...

DomesticExtremis
04-26-2013, 01:34 PM
OK, be honest here, who knows the numeric entities for the following...

&amp;
&nbsp;
&mdash;
&lsquo;
&rsquo;
&ldquo;
&rdquo;

http://www.w3schools.com/tags/ref_entities.asp

Google - makes goldfish of us all :)

DaleDe
04-26-2013, 04:22 PM
http://www.w3schools.com/tags/ref_entities.asp

Google - makes goldfish of us all :)

I am sure JSWolf knows how to look them up but his point was that you could code the names by memory but no one is likely to memorize the numbers as they are arbitrary without any intrinsic intelligence.

Dale

JSWolf
04-27-2013, 11:05 PM
I am sure JSWolf knows how to look them up but his point was that you could code the names by memory but no one is likely to memorize the numbers as they are arbitrary without any intrinsic intelligence.

Dale

Exactly. All of those names came from memory. But for the numeric values, I would have to look them up.