MobileRead Forums - View Single Post - UTF-16 forced to UTF-8

KevinH · 09-22-2015, 08:20 AM

Hi,

Sigil can read utf-16 files (or almost any properly encoded text file) but will convert it to the defacto standard of utf-8 when saving.

In short ucs-2/utf-16 became popular when people thought 16 bit numbers (65K chars) would be enough. It now takes 21 bits to fully encode all possible unicode characters. So to be clear utf-16 requires multi-char escape sequences to represent all characters in unicode just like utf-8 does. The advantages of utf-8 over utf-16 is that it is not endian dependent (ie. is a stream of bytes), where as utf-16 values must either be stored as big-endian or little-endian byte order. Thus utf-16's need for an extra BOM to tell the end user what order the two bytes that make up each value are provided in. Some cpus (though few today) are big-endian natively (namely the sparc and ppc) while others are little-endian (Intel's x86 family) natively, in addition there is network endianness (generally big) to deal with when streaming files and/or serializing objects. The problem is there are many ucs-2 routines masquarading as utf-16 routines that will break things but only for for specific ranges of unicode characters. In addition most of the early e-readers would not work with utf-16.

Because utf-8 can represent all unicode chars, needs no BOM mark (ie. is endian independent), and generally results in smaller file sizes (as Doitsu rightly points out) except for some CJK uses, and can work in all e-readers, it has become the defacto standard. And one that Sigil will not deviate from.

If you want utf-16, simply create an output plugin for sigil. It can probably be done in just a handful of lines given python's strong encoding capabilities. The plugin simply needs to iterate over the application/xhtml+xml files, reading them in a utf-8 text, replace utf-8 with utf-16 in the xml header and then properly encode the file as utf-16 and add the right bom and write it out as binary data. Probably not more than a handful of lines in python.
Perfect use for a plugin.

KevinH

09-22-2015, 08:20 AM	#5
KevinH Sigil Developer Posts: 8,558 Karma: 5703586 Join Date: Nov 2009 Device: many	Hi, Sigil can read utf-16 files (or almost any properly encoded text file) but will convert it to the defacto standard of utf-8 when saving. In short ucs-2/utf-16 became popular when people thought 16 bit numbers (65K chars) would be enough. It now takes 21 bits to fully encode all possible unicode characters. So to be clear utf-16 requires multi-char escape sequences to represent all characters in unicode just like utf-8 does. The advantages of utf-8 over utf-16 is that it is not endian dependent (ie. is a stream of bytes), where as utf-16 values must either be stored as big-endian or little-endian byte order. Thus utf-16's need for an extra BOM to tell the end user what order the two bytes that make up each value are provided in. Some cpus (though few today) are big-endian natively (namely the sparc and ppc) while others are little-endian (Intel's x86 family) natively, in addition there is network endianness (generally big) to deal with when streaming files and/or serializing objects. The problem is there are many ucs-2 routines masquarading as utf-16 routines that will break things but only for for specific ranges of unicode characters. In addition most of the early e-readers would not work with utf-16. Because utf-8 can represent all unicode chars, needs no BOM mark (ie. is endian independent), and generally results in smaller file sizes (as Doitsu rightly points out) except for some CJK uses, and can work in all e-readers, it has become the defacto standard. And one that Sigil will not deviate from. If you want utf-16, simply create an output plugin for sigil. It can probably be done in just a handful of lines given python's strong encoding capabilities. The plugin simply needs to iterate over the application/xhtml+xml files, reading them in a utf-8 text, replace utf-8 with utf-16 in the xml header and then properly encode the file as utf-16 and add the right bom and write it out as binary data. Probably not more than a handful of lines in python. Perfect use for a plugin. KevinH