View Single Post
Old 03-28-2011, 09:59 PM   #14
oneillpt
Connoisseur
oneillpt began at the beginning.
 
Posts: 63
Karma: 46
Join Date: Feb 2011
Device: Kindle 3 (cracked screen!); PW1; Oasis
Quote:
Originally Posted by oneillpt View Post
...

although the Page Info for this page shown by FireFox using the right-click context menu is UTF-8. When I follow links I do find that the Page Info indicates ISO-8859-1, although the actual source contains no encoding declaration.

I'm more inclined to trust the explicit declaration as ISO-8859-15 in the initial page, and to assume that subsequent pages will have been produced using the same encoding. The encoding reported in the headers will depend on the server configuration, and may or may not be reliable.
A few further comments on the choice of encoding. Although I said above that I'm more inclined to trust the explicit encoding declaration in preference to the HTTP headers, it is not unusual to find the explicit declaration is also wrong, as in the examples further below. For www.inter.it I would suspect that ISO-8859-15 is more likely than ISO-8859-1, but in fact Windows-1252 might in fact be the true encoding, even though it is neither explicitly declared nor returned by the HTTP headers, and cannot be confirmed nor ruled out on the basis of the web content I have seen so far.

First, why do I suggest ISO-8859-15 is more likely that ISO-8859-1? Quite simply, for countries within the Euro zone, such as Italy, the Euro symbol is likely to occur in news text. The Euro sign and a few characters used in Finnish and French are missing in ISO-8859-1. ISO-8859-15 updates ISO-8859-1 by introducing these characters, replacing a few infrequently used characters in ISO-8859-1. So any Finnish or French text which may require the characters missing from ISO-8859-1, or text using the Euro symbol, is likely to be ISO-8859-15, Windows-1252 or UTF-8, not ISO-8859-1, even if it explicitly claims to be ISO-8859-1 or arrives with HTTP headers claiming ISO-8859-1. If fact, if it uses the Euro symbol, it cannot be ISO-8859-1.

Then why Windows-1252? Text encodes as Windows-1252 is often wrongly described as ISO-8859-1. Windows-1252 is a superset of ISO-8859-1 which includes the additional characters introduced in ISO-8859-15, but mapped differently, to byte codes which were unused in ISO-8859-1, rather than replacing infrequently characters in ISO-8859-1. For example, the Euro symbol is mapped to 0x80 in Windows-1252, and to 0xA4 in ISO-8859-15, where it causes an infrequently used character to be dropped. A hex editor can be used to check whether the encoding is Windows-1252 or ISO-8859-15 if any of these additional characters is present. For example, if the Euro symbol is found to be represented by 0x80, then the encoding is Windows-1252, if it is represented by 0xA4 it is ISO-8859-15, and if it is not represented by either of these codes it is likely to be UTF-8, represented by two bytes, 0xAC and 0x20. The four encodings discussed in this paragraph are the most likely candidates for "latin" text, although for example Hingarian, Irish and Welsh may require UTF-8 or a different encoding for a full character set.

Finally, two recent examples where I have had to use an encoding in the Calibre recipe which is different from the explicit encoding in the HTML input (and in the first case from the encoding returned in the HTTP headers):

(1) http://www.ladepeche.fr

http://www.ladepeche.fr/article/2011...o-du-c-ur.html

The HTML source explicitly claims ISO-8859-1, and the HTTP headers also claim ISO-8859-1. The text however contains "œ" [o and e run together as one character, in case this character does not display correctly in your browser]. This character is not included in ISO-8859-1. The actual encoding is Windows-1252, where the character is encoded as 0x9C. the Calibre recipe needs " encoding = 'Windows-1252' " in order to get the correct character displayed in the e-book.

(2) http://www.independent.ie

http://www.independent.ie/national-n...e-2595418.html

The HTML source explicitly claims "charset=utf-8". The HTTP headers shown in FireFox indicate ISO-8859-1. The Calibre built-in recipe does not specify an encoding, so UTF-8 will be used. In this case, although the Euro symbol appears in the text, encoding is not a problem as it appears as a HTML entity, not as a single byte code. The pound (£) sign however also appears in the text, and as the single byte ISO-8859-1 code. This however is not the appropriate UTF-8 encoding, and gives an invalid UTF-8 byte sequence, so Calibre follows the rules for a UTF-8 decoder and replaces the pound sign by the "replacement character" � [white question mark on black diamond background if your browser does not display it correctly] to indicate the invalid UTF-8 byte encountered. " encoding = 'ISO-8859-1' " is needed in the recipe to obtain the correct display in the generated e-book. (In fact, since the Euro symbol appears as a HTML entity, it is possible that the encoding should be either Windows-1252 or ISO-8859-15. If at some point the Euro symbol appears as a single byte code rather than a HTML entity it may become necessary to specify one of these two encodings instead of ISO-8859-1, depending on which single byte encoding is used).

Both the example URLs just given above are still live. When they disappear it will become necessary to browse for other pages at the two newspapers to find similar examples.
oneillpt is offline   Reply With Quote