Possible bug with E-book Viewer

tornado5528 · 02-22-2014, 09:32 AM

Experienced an issue viewing a book with E-book Viewer using Win 7 with latest 1.25 (64bit) version.
Seemed to be a problem showing some punctuation, specifically apostrophes and quotes. Somebody a lot more knowledgeable on the subject than I suggested the following:

I re-read the relevant parts of the OPS specification, and this is definitely a reader bug. The book is valid as it stands, the XML prolog for the component OPS documents is not required, and content encoding related <meta> tags in the document's <head>element may not cause the reader to use the wrong encoding (only UTF-8 and UTF-16 are allowed for EPUB).
Section 1.3.1: Relationship to XML states:

Quote:

Reading Systems must be XML processors as defined in XML 1.0. All OPS Content Documents must be valid XML documents according to their respective schemas.

The XML 1.0 specification Section 2.8 Prolog and Document Type Declaration states that the XML prolog is optional:

Quote:

[Definition: XML documents SHOULD begin with an XML declaration which specifies the version of XML being used.]

Note - should, not must - and even if it weren't optional, the encoding attribute is optional as well, only the version attribute is mandatory if the content producer chooses to include the prolog.
Back to the OPS spec, section 1.3.6: Relationship to Unicode states:

Quote:

Publications may use the entire Unicode character set, using UTF-8 or UTF-16 encodings, as defined by Unicode (see http://www.unicode.org/unicode/standard/versions). The use of Unicode facilitates internationalization and multilingual documents. However, Reading Systems are not required to provide glyphs for all Unicode characters.

In other words, it's mandatory to use either UTF-8 or UTF-16. Distinguishing between UTF-8 and UTF-16 is trivial (doubly so since all UTF-16 content I've ever seen includes a byte order mark) and autodetection should never, ever fail.
The spec continues with:

Quote:

Reading Systems must parse all UTF-8 and UTF-16 characters properly (as required by XML). Reading Systems may decline to display some characters, but must be capable of signaling in some fashion that undisplayable characters are present. Reading Systems must not display Unicode characters merely as if they were 8-bit characters. For example, the biohazard symbol (0x2623) need not be supported by including the correct glyph, but must not be parsed or displayed as if its component bytes were the two characters "&#" (0x0026 0x0023).

yet this is exactly what calibre 1.25 on Windows appears to do.
The problem is that the OPS documents in the book specify the wrong content encoding via

Code:

<meta content="text/html; charset=iso-8859-1" http-equiv="content-type"/>

i.e. they're UTF-8 encoded but the meta specifies ISO8859-1 encoding. However, this is still a reader bug; the renderer isn't supposed to override the encodings mandated in the spec just because the publisher included a non-normative <meta> element. The semantics of <meta> aren't part of the specification, just its form, thus the semantics aren't normative and can't override normative parts of the spec.
Furthermore, the book passes epubcheck 3.0.1 (minus the bogus warnings about the XPGT template, which is an acknowledged epubcheck bug) and FlightCrew.
Clearly a calibre bug, specifically on Windows, sorry

Don't know what to suggest, besides opening a ticket on the calibre bugtracker

Perhaps downgrading to a version which didn't have the bug might be an option - don't know which one that would be, no Windows machine to test.

Edit: I'm not defending the publisher here - there's a big difference between "technically correct" and "reasonable"; they could have easily not included that <meta> element there and then the problem wouldn't have existed in the first place. That <meta> is literally lying - it makes a factually wrong statement about the content. Still, calibre's viewer should have handled this according to spec, that's why I claim this situation to be a reader bug.

Edit 2: Hmm, calibre uses QWebPage via PyQt to render the content. This might be a PyQt bug on Windows; calibre definitely does the right thing when loading the content:

Quote:

src/calibre/gui2/viewer/documentview.py wrote:

Code:

load_html(path, self, codec=getattr(path, 'encoding', 'utf-8'), mime_type=getattr(path,
            'mime_type', 'text/html'), pre_load_callback=callback)

Edit 3: Reconsidered the above, this would still be a calibre bug. calibre specifies the default encoding as UTF-8 but doesn't force that encoding, so QWebPage performs standard autodetection and ends up on ISO8859-1 due to that stupid <meta> element. QWebPage has no way of magically knowing it's only supposed to be either UTF-8 or UTF-16 because it's a generic renderer and not a specialized EPUB renderer, so it silently does the wrong thing. calibre should perform character encoding detection by itself and force that encoding when rendering. Not sure why it does that on Windows. I build calibre on my own and it's linked against Qt 4.8.4 via PyQt 4.10. No idea which versions the Windows port uses.

Would somebody mind looking into this, please?

kovidgoyal · 02-22-2014, 11:47 AM

This is a wontfix. The viewer works with lots of formats other than epub. And regardless of what the spec says, actual epubs in the wild occur in lots of encodings other than UTF-8/16. By forcing UTF-8 encoding I would be penalising people that declare their encoding properly, but do not follow the spec, in favor of people that follow the spec but do not declare the encoding correctly.

The former set of people are far dearer to my heart.

tornado5528 · 02-22-2014, 01:05 PM

That's fair enough. Thank you for taking the time to look and respond.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Another possible bug - Image viewer without tool bars in covers	arspr	Kobo Reader	6	12-05-2013 08:21 PM
0.8.44 E-book Viewer display bug?	Lucian	Calibre	2	03-30-2012 08:42 AM
E-Book Viewer as standalone viewer	Peter Swallow	Library Management	3	05-15-2011 12:06 PM
bug ? calibre epub viewer shows old title	cybmole	Calibre	11	03-03-2011 10:26 AM
How to reproduce the image viewer bug	sirbruce	Amazon Kindle	9	06-11-2009 08:44 PM

02-22-2014, 11:47 AM	#2
kovidgoyal creator of calibre Posts: 45,397 Karma: 27756918 Join Date: Oct 2006 Location: Mumbai, India Device: Various	This is a wontfix. The viewer works with lots of formats other than epub. And regardless of what the spec says, actual epubs in the wild occur in lots of encodings other than UTF-8/16. By forcing UTF-8 encoding I would be penalising people that declare their encoding properly, but do not follow the spec, in favor of people that follow the spec but do not declare the encoding correctly. The former set of people are far dearer to my heart.

02-22-2014, 01:05 PM	#3
tornado5528 Avid Reader Posts: 8 Karma: 10 Join Date: May 2013 Location: UK Device: Sony PRS-505, Nook NSTG, Nook HD+	That's fair enough. Thank you for taking the time to look and respond.

Advert