View Full Version : Epub format, B & N PubIt!, and HTML character entities


jlandahl
04-05-2011, 05:47 PM
I've been working on learning epub format for some time and thought I'd mastered the basics when I finally got to the point where my ebooks could pass epubcheck 1.1. However, when I uploaded a book to Barnes & Noble's PubIt! recently, I noticed that certain 3-digit ISO 8859/1 character entities that went smoothly through epubcheck and rendered as intended in Calibre and on the Kindle were displayed as question marks in the Nook preview! The workaround was to replace them with the corresponding 4-digit Unicode character entities, but now I wonder which type to use for other devices like the iPad and the Sony Reader.

Five special characters are involved that I know of. These are the ones for smart or slanted single and double quotes and the one for the en dash. To my surprise, the one for the copyright symbol does render as intended in the Nook preview. In general, 3-digit numeric character entities can be relied on to be supported by most ebook readers, but this particular set was also an exception to this rule in Microsoft Reader.

Here's the list of special characters, the 3-digit character entities, and their 4-digit equivalents:
left single quote ‘ ‘
right single quote ’ ’
left double quote “ “
right double quote ” ”
en dash – –

Here's example of the HTML and how it renders.

“Stop in the name of the law – I recognize you, prisoner ‘ 94621’,“ he cried out!

Intended rendering:
“Stop in the name of the law – I recognize you, prisoner ‘94621’,” he cried out!

Nook preview:
?Stop in the name of the law ? I recognize you, prisoner ? 94621?,? he cried out!

The 3-digit character entity that does render properly in the Nook preview is ©, the copyright symbol.

By the way, one thing I can't rule out completely is that this rendering issue is due to a recent update to the Nook software, since the two ebooks I released earlier for the Nook now turn out on examination to have the ?'s too, and I'm surprised that I overlooked them during my initial Nook previewing, which I thought was pretty thorough.

Comments, anyone? Have I missed some subtlety of epub format?

Jellby
04-06-2011, 04:46 AM
I noticed that certain 3-digit ISO 8859/1 character entities that went smoothly through epubcheck and rendered as intended in Calibre and on the Kindle were displayed as question marks in the Nook preview! The workaround was to replace them with the corresponding 4-digit Unicode character entities, but now I wonder which type to use for other devices like the iPad and the Sony Reader.

The 3-digit character entity that does render properly in the Nook preview is ©, the copyright symbol.

Your 3-digit codes are probably referring to some Windows codepage encoding, while ePUB requires everything to be in Unicode. The placement of the quote marks in these two encodings is different, the copyright symbol happens to be in the same slot (A9 = 169).

Use Unicode references everywhere (or input the characters directly in UTF8) and it should be fine, otherwise you are asking for problems, even if it sometimes work (because you are lucky, mainly). Or use real entities: “ ” ‘ ’ – ©

jlandahl
04-06-2011, 10:18 PM
Thank you for the helpful advice. I got away from using named character entities like “ some time ago because they weren't always properly rendered, whereas the numeric ones were. For example, in my original post on this board, & didn't render properly and I had to use &.

It's interesting that epubcheck 1.1 doesn't catch non-Unicode character entities, and that, given that the 3-digit codes are Windows ones, they were never rendered properly by Microsoft Reader!

Jellby
04-07-2011, 04:38 AM
Thank you for the helpful advice. I got away from using named character entities like “ some time ago because they weren't always properly rendered, whereas the numeric ones were. For example, in my original post on this board, & didn't render properly and I had to use &.

That's strange, I've never had any problem with named entities.

It's interesting that epubcheck 1.1 doesn't catch non-Unicode character entities, and that, given that the 3-digit codes are Windows ones, they were never rendered properly by Microsoft Reader!

Well, they are not exactly non-Unicode, it's just that in Unicode that particular slot is not assigned to the character you want. For example ’ is #8217 in Unicode, and #146 in Windows-1258; but #146 in Unicode is just a control character (Private Use 2). If you use #146 in your code, epubcheck has no way of knowing whether you wanted to use the right single quote or the control character.