EPUB2 and the DOCTYPEgate - Page 2

DiapDealer · 02-24-2014, 09:27 AM

Quote:

Originally Posted by roger64

It's not XML but pure XHTML though...

XHTML still validates as XML does it not? It's in the XML family (at least insofar as it conforms to XML rules), thus meeting the "must be valid xml with xhtml" requirement. Whether or not a specific version of xhtml is required by epub 2.01 specs is where things get fuzzy (or go away entirely).

DaleDe · 02-24-2014, 09:47 AM

Quote:

Originally Posted by Doitsu

It does indeed more or less disappear when the files are zipped, but since the reading software has to unzip the .html files in order to render the text and some older readers have problems with .html files significantly larger than 280 KB, IMHO, there's no point in using utf-16 in the first place.

AFAIK, utf-16 only needs to be used with certain APIs that expect utf-16 formatted strings and for the proper handling of some CJK characters.

Actually UTF-16 is preferred for some languages. UTF-8, as you point out, is perfect for English and Latin based languages but it is a variable length and will actually produce a larger file than UTF-16 for some languages which is why it is in the standard. Check our wiki on Unicode.

Dale

Doitsu · 02-24-2014, 11:55 AM

Quote:

Originally Posted by DaleDe

Actually UTF-16 is preferred for some languages. UTF-8, as you point out, is perfect for English and Latin based languages but it is a variable length and will actually produce a larger file than UTF-16 for some languages which is why it is in the standard.

Chinese utf-16 files are on average only 30% smaller than the corresponding utf-8 files. However, English utf-16 files are on average twice the size of the corresponding utf-8 files.

I.e., the size advantage isn't that great, even for Chinese texts who benefited the most from the introduction of the utf-16 standard.

Arios · 02-24-2014, 04:23 PM

If we return back to the initial interrogation of Roger ("Kovid Goyal writes that the DOCTYPE is required only when there are named entities (like nbsp)"), I think the warning of the Kovid's Book-Edit module (when doing a Run check) provided a good explanation:

Quote:

Warning [...]
OEBPS/Text/auteur.xhtml
Named entities are often only incompletely supported by various book reading software. Therefore, it is best to not use them, replacing them with the actual characters they represent. This can be done automatically. <i> is mine.

However, it remains to be seen whether the named entities really create problems, at least with digital eInk readers.

DiapDealer · 02-24-2014, 04:54 PM

Characters or entities; the devices can't display them if the glyphs they represent aren't part of their system fonts (assuming no fonts are embedded for this purpose).

Face it. Once you get beyond the extended Latin subset, there's a fairly substantial risk that neither the entity NOR the character will render properly without fonts being embedded (on most of the popular epub readers out there).

Toxaris · 02-25-2014, 02:24 AM

I am still wondering which devices/reading applications do not support named entities when DOCTYPE is used...

skreutzer · 07-18-2014, 07:49 PM

I've worked on this topic quite excessively lately (in terms of implementation), so here's a short overview about the technical aspects of it: a doctype declaration isn't some kind of ordinary XML tag or even a processing instruction, it's a separate notation introduced with SGML. The purpose of the declaration is to tell a XML processor which type of XML format is used, is it RSS, HTML, SVG, whatever (and which version of it)? If the doctype declaration is missing, any given XML processor has no way to tell the actual XML format except by reading, interpreting and then guessing. Validators are XML processors, too, and they usually rely on the self-descriptiveness of a XML file by its doctype, be it for XHTML or EPUB. As a fallback, the user might manually define the doctype he wants to use for a validation, but the wrong choice might lead to a valid/invalid result, since it was validated to the wrong Schema/DTD instead of the one it is supposed to be validated against. With EPUB2, as XHTML 1.1 is required, an EPUB validator can validate against XHTML 1.1 regardless of the doctype, since the EPUB version was specified in the OPF file and any validation of application/xhtml+xml item references in the OPF manifest will simply fail if it doesn't comply to XHTML 1.1 regardless of the doctype declaration.

But that's only one side of the story, the other side is equally important: XML is a universal encoding standard for text-based formats, so there are lots of powerful tools and programming libraries out there which allow interoperability (conversion between) XML formats. Those tools don't have to know the details of the very specific EPUB or XHTML context, they're build to operate on lots of different XML-based formats. As the XHTML files of an EPUB container can easily be extracted and processed, it's very likely that such tools will encounter those XHTML files from an EPUB at one place or another, especially in modern processing workflows and publishing systems (or the web technology stack). So if such software doesn't have a special handling for XHTML by default (and why should it?), it might run into the following problem: if an entity is encountered, it won't know what it means and what to do with it. Entities are a mechanism of DTD to express some kind of text replacement, which can be used for the encoding (masking, escaping) of special characters as in HTML, but it can also be used to implement centralized definitions of text portions, which should be used at several places without redefining them or place them literally into every single file. If the XML processor is instructed to replace the entities with their actual meaning, it obviously has to know the actual meaning from somewhere. The doctype declaration provides an unique identifier (both "official" names and URIs are used for this, since the latter are based on the principle of worldwide file path referencing, which is disambiguous) for the corresponding DTD with the entity definitions in it, and then it's up to the XML processor to determine if the DTD is available or not. Note that none of this is related to DTD validation, that's even the minimum requirement for being able to properly read a XML file with a universal XML processor. In order to provide the corresponding DTD, a user usually has to configure it in the processing software by hand as the URLs in the doctype declaration are only used as identifiers and don't specify a download location (which would be useless anyway in an offline environment), as one might assume. In case of HTML/XHTML, however, people just abused the DTD URL for download attempts, and as the W3C initially didn't provide any file at the URL web locations, XML processing of HTML/XHTML failed. After lots of complaints, they put the DTDs up under the URLs of their URL identifiers, but soon they encountered enormous traffic from a wide range of XML tools and libraries, which all just attempted to download DTDs automatically from W3C servers. Up to now, they're artificially delaying the response time for download attempts from their servers, they block IP ranges and user agents, hoping that those XML tools and libraries either implement a predefined catalog of locally stored DTDs (however, the licensing of their documents is a huge obstacle to it), or at least implement a caching mechanism for once-downloaded DTDs. At the same time, XHTML 1.1 is based on the concept of modularization (I guess the idea is the reuse of unchanged modules in future HTML versions, such as XHTML5), so the previous DTD is split into a set of individual parts. The full XHTML 1.1 DTD requires around 38 files, which might be required for processing a XHTML 1.1 file with a universal XML processor, as any of the DTD modules could contain an entity definition and therefore has to be available.

Now I can imagine that you guys call for a pragmatic solution. Well, that's not so easy. In case of XHTML, as it uses the entity mechanism only for special character replacements, it's already recommended to use the UTF-8 equivalents instead of the entities, the entities even won't be present any more in future DTDs (probably XHTML5 already doesn't have them, but I'm not sure about it), so those entities will be both invalid and unreadable by XML processors soon. The only exceptions are those of basic XML for greater-than, less-than, unicode character encoding by number or escaping the ampersand. Still, the entity mechanism is quite useful for other purposes than character encoding (which obviously got replaced by the Unicode standard). Modularization of XHTML is a smart idea in order to reduce the total amount of DTD space required, as parts can be reused instead of being copied for each new version or for the use in other XML based formats, so that hopefully won't go away. The doctype declaration will remain as "metadata" (or think of it as the mimetype of the file) in order to specify the actual format of a XML based file, which is very important for XML processors and validators. Omitting the doctype declaration in a XHTML 1.1 file of an EPUB will most likely work in a pure, isolated EPUB context as long as UTF-8 is used instead of special character entities, but it will cause problems if such an EPUB should be used in other contexts - and be sure that with modern publishing and the future web such other uses and contexts will develop, so it's up to you if you want to participate or if you want to revisit your files at a later date. In case of Calibre, I assume that Kovid Goyal doesn't want to take the effort to comply to the standard (as he doesn't appreciate standards in general, at least that's what he stated in a much easier case than XHTML entities) and gets away with it by the help of a custom XML parser, which does special tricks in order to not to run into those issues and to find "solutions" that work in pure EPUB context. But in opposition to Kovid Goyal, other people like me don't like to reinvent wheel all the time by implementing special parsers for all kinds of custom XML formats that processing tools may ever encounter, so EPUBs produced by Calibre will run into problems here and there. On the other hand, if the entity mechanism per DTD isn't used internally by Calibre at all (no common purpose XML processing library, instead a custom parser), it should be an absolutely easy fix to just write the correct doctype declaration into the XHTMLs of the EPUB2.

From the users perspective, the decision (between Calibre and, for instance Sigil or whatever else with doctype declaration) is short-term vs. long-term. Calibre is capable of producing EPUBs that "work", but you might have to revisit them some time in the future.

07-18-2014, 07:49 PM	#22
skreutzer Software Developer Posts: 189 Karma: 89000 Join Date: Jan 2014 Location: Germany Device: PocketBook Touch Lux 3	I've worked on this topic quite excessively lately (in terms of implementation), so here's a short overview about the technical aspects of it: a doctype declaration isn't some kind of ordinary XML tag or even a processing instruction, it's a separate notation introduced with SGML. The purpose of the declaration is to tell a XML processor which type of XML format is used, is it RSS, HTML, SVG, whatever (and which version of it)? If the doctype declaration is missing, any given XML processor has no way to tell the actual XML format except by reading, interpreting and then guessing. Validators are XML processors, too, and they usually rely on the self-descriptiveness of a XML file by its doctype, be it for XHTML or EPUB. As a fallback, the user might manually define the doctype he wants to use for a validation, but the wrong choice might lead to a valid/invalid result, since it was validated to the wrong Schema/DTD instead of the one it is supposed to be validated against. With EPUB2, as XHTML 1.1 is required, an EPUB validator can validate against XHTML 1.1 regardless of the doctype, since the EPUB version was specified in the OPF file and any validation of application/xhtml+xml item references in the OPF manifest will simply fail if it doesn't comply to XHTML 1.1 regardless of the doctype declaration. But that's only one side of the story, the other side is equally important: XML is a universal encoding standard for text-based formats, so there are lots of powerful tools and programming libraries out there which allow interoperability (conversion between) XML formats. Those tools don't have to know the details of the very specific EPUB or XHTML context, they're build to operate on lots of different XML-based formats. As the XHTML files of an EPUB container can easily be extracted and processed, it's very likely that such tools will encounter those XHTML files from an EPUB at one place or another, especially in modern processing workflows and publishing systems (or the web technology stack). So if such software doesn't have a special handling for XHTML by default (and why should it?), it might run into the following problem: if an entity is encountered, it won't know what it means and what to do with it. Entities are a mechanism of DTD to express some kind of text replacement, which can be used for the encoding (masking, escaping) of special characters as in HTML, but it can also be used to implement centralized definitions of text portions, which should be used at several places without redefining them or place them literally into every single file. If the XML processor is instructed to replace the entities with their actual meaning, it obviously has to know the actual meaning from somewhere. The doctype declaration provides an unique identifier (both "official" names and URIs are used for this, since the latter are based on the principle of worldwide file path referencing, which is disambiguous) for the corresponding DTD with the entity definitions in it, and then it's up to the XML processor to determine if the DTD is available or not. Note that none of this is related to DTD validation, that's even the minimum requirement for being able to properly read a XML file with a universal XML processor. In order to provide the corresponding DTD, a user usually has to configure it in the processing software by hand as the URLs in the doctype declaration are only used as identifiers and don't specify a download location (which would be useless anyway in an offline environment), as one might assume. In case of HTML/XHTML, however, people just abused the DTD URL for download attempts, and as the W3C initially didn't provide any file at the URL web locations, XML processing of HTML/XHTML failed. After lots of complaints, they put the DTDs up under the URLs of their URL identifiers, but soon they encountered enormous traffic from a wide range of XML tools and libraries, which all just attempted to download DTDs automatically from W3C servers. Up to now, they're artificially delaying the response time for download attempts from their servers, they block IP ranges and user agents, hoping that those XML tools and libraries either implement a predefined catalog of locally stored DTDs (however, the licensing of their documents is a huge obstacle to it), or at least implement a caching mechanism for once-downloaded DTDs. At the same time, XHTML 1.1 is based on the concept of modularization (I guess the idea is the reuse of unchanged modules in future HTML versions, such as XHTML5), so the previous DTD is split into a set of individual parts. The full XHTML 1.1 DTD requires around 38 files, which might be required for processing a XHTML 1.1 file with a universal XML processor, as any of the DTD modules could contain an entity definition and therefore has to be available. Now I can imagine that you guys call for a pragmatic solution. Well, that's not so easy. In case of XHTML, as it uses the entity mechanism only for special character replacements, it's already recommended to use the UTF-8 equivalents instead of the entities, the entities even won't be present any more in future DTDs (probably XHTML5 already doesn't have them, but I'm not sure about it), so those entities will be both invalid and unreadable by XML processors soon. The only exceptions are those of basic XML for greater-than, less-than, unicode character encoding by number or escaping the ampersand. Still, the entity mechanism is quite useful for other purposes than character encoding (which obviously got replaced by the Unicode standard). Modularization of XHTML is a smart idea in order to reduce the total amount of DTD space required, as parts can be reused instead of being copied for each new version or for the use in other XML based formats, so that hopefully won't go away. The doctype declaration will remain as "metadata" (or think of it as the mimetype of the file) in order to specify the actual format of a XML based file, which is very important for XML processors and validators. Omitting the doctype declaration in a XHTML 1.1 file of an EPUB will most likely work in a pure, isolated EPUB context as long as UTF-8 is used instead of special character entities, but it will cause problems if such an EPUB should be used in other contexts - and be sure that with modern publishing and the future web such other uses and contexts will develop, so it's up to you if you want to participate or if you want to revisit your files at a later date. In case of Calibre, I assume that Kovid Goyal doesn't want to take the effort to comply to the standard (as he doesn't appreciate standards in general, at least that's what he stated in a much easier case than XHTML entities) and gets away with it by the help of a custom XML parser, which does special tricks in order to not to run into those issues and to find "solutions" that work in pure EPUB context. But in opposition to Kovid Goyal, other people like me don't like to reinvent wheel all the time by implementing special parsers for all kinds of custom XML formats that processing tools may ever encounter, so EPUBs produced by Calibre will run into problems here and there. On the other hand, if the entity mechanism per DTD isn't used internally by Calibre at all (no common purpose XML processing library, instead a custom parser), it should be an absolutely easy fix to just write the correct doctype declaration into the XHTMLs of the EPUB2. From the users perspective, the decision (between Calibre and, for instance Sigil or whatever else with doctype declaration) is short-term vs. long-term. Calibre is capable of producing EPUBs that "work", but you might have to revisit them some time in the future. Last edited by skreutzer; 07-18-2014 at 08:01 PM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Have Apple deprecated the display-options.xml file in ePub2?	Oxford-eBooks	ePub	6	11-27-2013 08:30 AM
How to solve EPUB3/EPUB2 rendering issue on Ipad	E-Books	ePub	2	05-16-2013 07:07 AM
Confused! XHTML, HTML, HTML5, EPUB2, EPUB3???	carlosbcg	ePub	29	02-23-2013 07:32 PM
refined metadata in epub2?	mzmm	ePub	2	11-14-2012 01:52 PM
JAVASCRIPT support in ePub2/ePub3	Raja1205	ePub	7	09-03-2012 06:48 AM

02-24-2014, 04:54 PM	#20
DiapDealer Grand Sorcerer Posts: 27,549 Karma: 193191846 Join Date: Jan 2010 Device: Nexus 7, Kindle Fire HD	Characters or entities; the devices can't display them if the glyphs they represent aren't part of their system fonts (assuming no fonts are embedded for this purpose). Face it. Once you get beyond the extended Latin subset, there's a fairly substantial risk that neither the entity NOR the character will render properly without fonts being embedded (on most of the popular epub readers out there).

02-25-2014, 02:24 AM	#21
Toxaris Wizard Posts: 4,520 Karma: 121692313 Join Date: Oct 2009 Location: Heemskerk, NL Device: PRS-T1, Kobo Touch, Kobo Aura	I am still wondering which devices/reading applications do not support named entities when DOCTYPE is used...