Quote:
Originally Posted by Toxaris
It has nothing to do with IBM, but with internationalization. More and more websites and applications (don't even mention databases) need to be able to handle multiple languages and characters. The old ISO pages cannot handle that due to restrictions. They invented all kind of tricks in the beginning, creating nightmares (I have seen enough).
|
This. ISO-8859-1 is one of the old ISO standards for character encoding. In the ISO standards, each character was exactly one byte long, which meant that it could represent only 256 possible character values. Of those, the English letters, numbers, punctuation, and control characters took up the first half, leaving only about 127 usable slots to represent language-specific characters.
Because of that limitation, different language families required different encodings. You couldn't, for example, have French text and Greek text in the same document. Also, you couldn't support languages like Chinese at all, because there are many thousands of Chinese characters in common use.
Worse, you couldn't determine just by looking at a file whether it was ISO-8859-1, ISO-8859-2, ISO-8859-16, etc., so you had to externally specify the character set in some way, or else French could turn into a gibberish Greek-English hybrid. In the case of HTML, they specified the character set in a meta tag, which meant that the browser would read the file up to that point, say "Oh, crap, I'm using the wrong encoding", then reread the entire file using the right encoding. It was an absolute mess, and I'm being generous here.
The Unicode standard fixed all of this by supporting over a million possible characters. As a result, you can use a single character set to represent text in every language on Earth. With Unicode, you can have Chinese text, English text, French text, and Russian text on a single web page.
UTF-8 is the most common way to encode Unicode text, because English characters, numbers, and punctuation are encoded using a single byte, making English UTF-8 content fully backwards-compatible with software that supports traditional 7-bit ASCII (and ISO-8859-*).