View Single Post
Old 11-21-2022, 04:15 PM   #2
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 8,887
Karma: 6120478
Join Date: Nov 2009
Device: many
Composed characters (made up of two of more separate unicode codepoints) are allowed according to the unicode spec. Even when a single codepoint version exists. Unicode uses normalization to simplify ordering of these multi codepoint sequences.

See https://unicode.org/reports/tr15/#Norm_Forms

From the html spec:
Quote:
One might think that this would be a serious problem. However, most software systems consistently use a single Unicode representation to represent most languages/scripts, even though multiple representations are theoretically possible in Unicode. This form is typically very similar to Unicode Normalization Form C (or "NFC"), in which as many combining marks as possible are combined with base characters to form a single code point (NFC also specifies the order in which combining marks that cannot be combined appear; Unicode normalization forms do not guarantee that there will be no combining marks, as some languages/scripts cannot be encoded at all except via the use of combining characters). As a result, few users encounter issues with Unicode canonical equivalence. A recent survey of the Web concluded that over 99% of all content is in NFC.
The epub3 spec is mute on this topic except to say that file paths in the opf and urls must be NFC normalized.


So I would consider this a failure of your reading system for not fully supporting unicode.

That said, one way to deal with this instead of searching for all combinations of composed characters is to read in each file and be sure to NFC normalize it before writing the file out.


This is simple in python and could be easily part of a plugin.

Code:
>>> print(ascii(unicodedata.normalize('NFC', '\u0061\u0301')))
'\xe1'
And is simple to do in C++/Qt as well.

Code:
QString	normalized(QString::NormalizationForm mode, QChar::UnicodeVersion version = QChar::Unicode_Unassigned) const
And some code editors like emacs and etc have a way to normalize text files as well.

Last edited by KevinH; 11-21-2022 at 04:55 PM.
KevinH is online now   Reply With Quote