MobileRead Forums - View Single Post - Hacks Foreign language support

Thomas Ryan · 03-01-2009, 04:22 AM

Quote:

Originally Posted by igorsk

Kindle does support UTF-8 in its Mobi books and probably in the browser. However, the built-in fonts have glyphs only for Latin characters, so any non-Western languages won't work.

To add some color - UTF-8 can be used to represent any Unicode character (just about any character/glyph you can think of displaying), but as noted , Kindle's fonts generally don't have the data to display characters outside the standard Latin characters. There is no built in data to support for Chinese, Korean, or Japanese characters.

From wikipedia:
Initially Kindle 1 only supported the ISO 8859-1 (Latin 1) character set for its content and Unicode characters and non-western characters were not supported. The firmware update of February 2009 supports additional character sets including ISO 8859-16.

See http://en.wikipedia.org/wiki/UTF-8
and
http://en.wikipedia.org/wiki/ISO_8859-16
to gain a better understanding of terminology and what the characters are.

The other route to display characters is image based. Tear apart Da Vinci code, for example. Those funny characters appear in the ebook, but I am not aware of any Unicode encoding for them. They are image based. Thus in an oblique (brute-force) way the "font" is embedded in the content.

This technique could be extended to display just about anything on a Kindle screen. One degenerate case would be one image per page that snapshots the page with any writing system content. This is inefficient, unsearchable, no annotations within a page, no font size changes, etc., but "do-able".
The other would be one image for every character. Same problems.

The analogy for the first is a fax page - you can write whatever you want on a fax page - no encoding, no fonts, no plug-ins, etc required. The analogy for the second is those weird ransom notes you seen in books/movies from the 1980's - images individually clipped from magazines and strung together.

(We HAve YouR ChiLd)

Next, as noted elsewhere, Topaz is Amazon's proprietary format. If you look long enough and hard enough at Topaz content, you might conclude it is a hybrid between mobi (characters encoded) and image-based data.

The site http://www.latenightcode.com/devblog...tions-part-ii/ seems to confirm this.

To summarize, a clever programmer may find a way to add additional UTF-8 font data to Kindle, and it might just work. Or, a clever programmer may create a Topaz content creation/conversion tool to display non-latin characters, and that just might work too.

(Line layout, e.g. bi-directional text, ligatures, etc., needs to move to another thread.)