MobileRead Forums - View Single Post - ALERT: Potential Issues with Sigil 2.2.X and rtl languages and Normalization Forms

democrite · 06-28-2024, 03:31 PM

Kevin,

Thank you very much for taking the time to be very helpful and also for really trying to consider all viewpoints and thoroughly think through the issue. Such is quite rare.

My issue is with characters that have multiple diacritics. I had attached a file in my previous message. One character in question is in a recent EPUB using the pinyin term nǚ.

After converting to NFC, that I can tell, there is no way to return to the character as entered. Some are concerned with data preservation, and want the original representation. That I can tell, after conversion to NFC, there is no way to search for that character as entered by typing it in again. Such would also be true, perhaps for other cases too, with possibly numerous or any other editor, current or future.

I am aware of the BBEdit command. There is also one to decompose unicode. There is also an expert preference to precompose Unicode when pasting. Yet that I can tell, none help for that specific issue.

Not everyone is going to eternally exclusively use Sigil for their editing. Sigil might not be around forever. Someone may use other editors. I use BBEdit.

Looking around, there seems to be a tendancy to not want to automatically convert text. From what I recall, JetBrains at one point automatically converted text to NFC, and some user complained through a bug report with others also sharing the same sentiment, so they backed out of it.

Trying to use various AI services such as Perplexity, I found various issues, many of which I'm not sure if they apply:

Quote:

normalizing ≯ [U+003E GREATER-THAN SIGN + U+0338 COMBINING LONG SOLIDUS OVERLAY] to ≯ [U+226F NOT GREATER-THAN] can corrupt XML code

Quote:

some people use compatibility characters in their content without realizing it, like ¼ [U+00BC VULGAR FRACTION ONE QUARTER] or № [U+2116 NUMERO SIGN]. Normalizing this content may affect the look or readability

From a W3C document, Normalization in HTML and CSS:

Quote:

there may sometimes be good reasons to mix normalized forms.

Quote:

You should also try to avoid automatically converting content from one normalization form to another, as it may obliterate some important code point distinctions, such as in the carefully crafted examples of világ above, or in filenames or URLs, or text included in the page from elsewhere, etc.

The document also shows a screen shot of Dreamweaver which had a setting to either select none, NFC, or NFD.

https://www.w3.org/International/que...-normalization

if this applies or not, I'm not sure, but among other issues found, perhaps some would create EPUBs containing such content:

Quote:

Some types of linguistic analysis or text mining projects may rely on or benefit from the distinctions that normalization would eliminate. The original representation might carry valuable information for these specialized use cases.

Quote:

Not normalizing allows the text to remain in its original form as entered by users. This could potentially be beneficial in some specialized linguistic or cultural contexts where the specific Unicode representation carries meaning.

Quote:

Some historical texts or languages use variant forms of characters that cannot be accurately handled by Unicode normalization. This can result in loss of nuance or meaning when normalizing text.

quoted from a node.js document:

Quote:

What happens when the Unicode standard advances to include a slightly different normalization algorithm (as has happened in the past)?

There was also this W3C document, Unicode in XML and other Markup Languages, which seems to have been withdrawn yet may still provide useful information about possible issues.

https://www.w3.org/TR/unicode-xml/

A document I found though I haven't read it thoroughly yet seems like it might be of use, Unicode® Standard Annex #15 UNICODE NORMALIZATION FORMS:

https://unicode.org/reports/tr15/

I think it is a good idea to thoroughly continue to investigate this issue, and not go by the recommendations of a few users. As Sigil has a much lower userbase than other editors, it may be more difficult to get good feedback concerning all the possible issues.

There seemed to also be mention of reading systems and doing diacritic insensitive search. Readers, at least several, it seems handle that fine yet it seems to take more work for such, and maybe on newer faster devices it is less of an issue.

It seems that reader and perhaps others systems normalize text before search. Maybe that is a better approach, to leave source text as is, and just normalize it and the search string for search operations.

At one point, it seemed you had thought maybe it is better to leave text alone. I strongly think that should remain how it is, with commands as you suggested to convert to NFC or NFD. Plus possibly preferences for such. Any changes that you have made to support exclusively NFC, I think such is best left as a preference, maybe default or maybe not; I'm not sure.