MobileRead Forums - View Single Post - Two characters look like one in Sigil, but wrong on the e-reader

KevinH · 11-21-2022, 04:15 PM

Composed characters (made up of two of more separate unicode codepoints) are allowed according to the unicode spec. Even when a single codepoint version exists. Unicode uses normalization to simplify ordering of these multi codepoint sequences.

See https://unicode.org/reports/tr15/#Norm_Forms

From the html spec:

Quote:

One might think that this would be a serious problem. However, most software systems consistently use a single Unicode representation to represent most languages/scripts, even though multiple representations are theoretically possible in Unicode. This form is typically very similar to Unicode Normalization Form C (or "NFC"), in which as many combining marks as possible are combined with base characters to form a single code point (NFC also specifies the order in which combining marks that cannot be combined appear; Unicode normalization forms do not guarantee that there will be no combining marks, as some languages/scripts cannot be encoded at all except via the use of combining characters). As a result, few users encounter issues with Unicode canonical equivalence. A recent survey of the Web concluded that over 99% of all content is in NFC.

The epub3 spec is mute on this topic except to say that file paths in the opf and urls must be NFC normalized.

So I would consider this a failure of your reading system for not fully supporting unicode.

That said, one way to deal with this instead of searching for all combinations of composed characters is to read in each file and be sure to NFC normalize it before writing the file out.

This is simple in python and could be easily part of a plugin.

Code:

>>> print(ascii(unicodedata.normalize('NFC', '\u0061\u0301')))
'\xe1'

And is simple to do in C++/Qt as well.

Code:

QString	normalized(QString::NormalizationForm mode, QChar::UnicodeVersion version = QChar::Unicode_Unassigned) const

And some code editors like emacs and etc have a way to normalize text files as well.