MobileRead Forums - View Single Post

skreutzer · 04-14-2014, 01:41 PM

Well, since you've provided an EPUB3 example in your last post, I'm concluding that you're aiming to construct an EPUB3 file instead of EPUB2 (which wasn't explicitly stated in your initial post). Depending on the version of the EPUB standard, the package and the packaged files have to look differently. Validating your example file with the IDPF EPUB validator results in

Code:

Type: ERROR, File: Text/index.html, Line: 8, Position: 786, Message: value of attribute "href" is invalid; must be a UR

With this information, the source of the problem can be localized. After unpacking your example file and looking at the mentioned index.html file in the "Text" subdirectory, it already looks suspicious, because EPUB3 uses HTML5, but index.html looks like an incomplete XHTML file. After validating index.html with the HTML validator of W3C, it at first seems to be OK, but as the validator points out in the warning message

Code:

No DOCTYPE found! Checking with default XHTML 1.0 Transitional Document Type.

that's only because the validator wasn't instructed to do HTML5 validation, and so he picked XHTML 1.0 Transitional, which results in validation success. However, in order to trigger HTML5 validation for the W3C validator and statisfy the requirement of HTML5 for EPUB3, you need to place the HTML5 doctype

Code:

<!doctype html>

between the XML processing instruction and the root element:

Code:

<?xml version='1.0' encoding='utf-8'?>
<!doctype html>
<html xmlns="http://www.w3.org/1999/xhtml">

Now validating with the W3C validator leads to a rather different result: 3 errors, 2 warnings. The first complaint is

Code:

The character encoding was not declared. Proceeding using windows-1252.

and the second is

Code:

Saw <?. Probable cause: Attempt to use an XML processing instruction in HTML. (XML processing instructions are not supported in HTML.) <?xml version='1.0' encoding='utf-8'?>

which are both somewhat connected. Unfortunately, for whatever reason, the W3C designed HTML5 with XML incompatibility, and so there's no way to specify the character encoding in the XML processing instruction as one would usually expect. Instead, the encoding needs to be specified as meta-element in the head area:

Code:

<!doctype html>
<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <title>URI test</title>
    <meta charset="utf-8"/>
    [...]
  </head>
  [...]
</html>

while the XML processing instruction needs to be removed. After that, 1 error and 1 warning are remaining:

Code:

Bad value http://www.hebrewbooks.org/pdfpager.aspx?sits=1&req=36784&st=%u05DE%u05D9%20%u05E9%u05D9%u05E9%20%u05DC%u05D5%20%u05E7%u05E6%u05EA%20%u05E9%u05E8%u05E8%u05D4 for attribute href on element a: Percentage ("%") is not followed by two hexadecimal digits.

Obviously, in difference to your initial post, the URL is much longer than the portion you've posted. The error message already points out the problem: URL escaping is done with a percentage character, followed by two digits. In your URL, the percentage character is followed by the character 'u', and it absolutely looks like this was meant to do UTF-8 character escaping, because 'u' will probably stay for Unicode, and hexadecimal 05DE is the HEBREW LETTER MEM (מ). The main question which needs to be answered is, if the referenced website expects this data in this invalid format, and indeed, if the invalid URL is browsed, it leads to a different result than an URL without the st-parameter in HTTP-GET. To preserve the percentage character while still being compliant with URL escaping rules, the percentage character itself needs to be URL escaped, because otherwise it would get interpreted as the start marker for a regular URL escaping entity: just replace every percent character with its URL encoded representation %25, so %u05DE will get %25u05DE. See

http://en.wikipedia.org/wiki/Percent-encoding

for more details on URL escaping. After that, the URL still won't pass HTML5 validation, because you're still missing the XML entity encoding of the ampersand with &, which was present in your initially posted link. After this last adjustment, index.html is valid HTML5, and therefore shouldn't cause any further problems when packaged to EPUB3. I haven't tried to package the files together and validate the resulting EPUB3, because I guess that there will be HTML5 invalidity in the other files as well, which you need to fix. If you need a packaging tool for EPUB3, just let me know, I could develop a simple one which would be absolutely sufficient for this kind of files. In any case, I don't know how you got to those HTML files, if you've written them by yourself or if you obtained them from an application, but whoever/whatever is responsible for producing this files seems to be not overly concerned about web standards, but indeed should.

04-14-2014, 01:41 PM	#6
skreutzer Software Developer Posts: 190 Karma: 89000 Join Date: Jan 2014 Location: Germany Device: PocketBook Touch Lux 3	Well, since you've provided an EPUB3 example in your last post, I'm concluding that you're aiming to construct an EPUB3 file instead of EPUB2 (which wasn't explicitly stated in your initial post). Depending on the version of the EPUB standard, the package and the packaged files have to look differently. Validating your example file with the IDPF EPUB validator results in Code: Type: ERROR, File: Text/index.html, Line: 8, Position: 786, Message: value of attribute "href" is invalid; must be a UR With this information, the source of the problem can be localized. After unpacking your example file and looking at the mentioned index.html file in the "Text" subdirectory, it already looks suspicious, because EPUB3 uses HTML5, but index.html looks like an incomplete XHTML file. After validating index.html with the HTML validator of W3C, it at first seems to be OK, but as the validator points out in the warning message Code: No DOCTYPE found! Checking with default XHTML 1.0 Transitional Document Type. that's only because the validator wasn't instructed to do HTML5 validation, and so he picked XHTML 1.0 Transitional, which results in validation success. However, in order to trigger HTML5 validation for the W3C validator and statisfy the requirement of HTML5 for EPUB3, you need to place the HTML5 doctype Code: <!doctype html> between the XML processing instruction and the root element: Code: <?xml version='1.0' encoding='utf-8'?> <!doctype html> <html xmlns="http://www.w3.org/1999/xhtml"> Now validating with the W3C validator leads to a rather different result: 3 errors, 2 warnings. The first complaint is Code: The character encoding was not declared. Proceeding using windows-1252. and the second is Code: Saw <?. Probable cause: Attempt to use an XML processing instruction in HTML. (XML processing instructions are not supported in HTML.) <?xml version='1.0' encoding='utf-8'?> which are both somewhat connected. Unfortunately, for whatever reason, the W3C designed HTML5 with XML incompatibility, and so there's no way to specify the character encoding in the XML processing instruction as one would usually expect. Instead, the encoding needs to be specified as meta-element in the head area: Code: <!doctype html> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <title>URI test</title> <meta charset="utf-8"/> [...] </head> [...] </html> while the XML processing instruction needs to be removed. After that, 1 error and 1 warning are remaining: Code: Bad value http://www.hebrewbooks.org/pdfpager.aspx?sits=1&req=36784&st=%u05DE%u05D9%20%u05E9%u05D9%u05E9%20%u05DC%u05D5%20%u05E7%u05E6%u05EA%20%u05E9%u05E8%u05E8%u05D4 for attribute href on element a: Percentage ("%") is not followed by two hexadecimal digits. Obviously, in difference to your initial post, the URL is much longer than the portion you've posted. The error message already points out the problem: URL escaping is done with a percentage character, followed by two digits. In your URL, the percentage character is followed by the character 'u', and it absolutely looks like this was meant to do UTF-8 character escaping, because 'u' will probably stay for Unicode, and hexadecimal 05DE is the HEBREW LETTER MEM (מ). The main question which needs to be answered is, if the referenced website expects this data in this invalid format, and indeed, if the invalid URL is browsed, it leads to a different result than an URL without the st-parameter in HTTP-GET. To preserve the percentage character while still being compliant with URL escaping rules, the percentage character itself needs to be URL escaped, because otherwise it would get interpreted as the start marker for a regular URL escaping entity: just replace every percent character with its URL encoded representation %25, so %u05DE will get %25u05DE. See http://en.wikipedia.org/wiki/Percent-encoding for more details on URL escaping. After that, the URL still won't pass HTML5 validation, because you're still missing the XML entity encoding of the ampersand with &, which was present in your initially posted link. After this last adjustment, index.html is valid HTML5, and therefore shouldn't cause any further problems when packaged to EPUB3. I haven't tried to package the files together and validate the resulting EPUB3, because I guess that there will be HTML5 invalidity in the other files as well, which you need to fix. If you need a packaging tool for EPUB3, just let me know, I could develop a simple one which would be absolutely sufficient for this kind of files. In any case, I don't know how you got to those HTML files, if you've written them by yourself or if you obtained them from an application, but whoever/whatever is responsible for producing this files seems to be not overly concerned about web standards, but indeed should.