View Single Post
Old 06-26-2024, 08:41 AM   #10
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 8,805
Karma: 6000000
Join Date: Nov 2009
Device: many
Hi Doitsu,

Thanks for all your help on this. I could not even tell which characters were changed without an object dump.

You are correct, nothing is lost in either representation in Sigil 2.2.x.

I just need to figure out how to 100% intercept text pasting into the xhtml file and into any replace fields, or force NFC when the find and replace buttons are pressed.

That really leaves keyboard input methods that might generate decomposed form to worry about but as you said, it should not be a big deal.

Thank you!

KevinH

Quote:
Originally Posted by Doitsu View Post
That difference is caused by Arabic letters with two combining diacritics. In this case:
U+0651 ّ d9 91 ّ ّ ARABIC SHADDA
U+064E َ d9 8e َ َ ARABIC FATHA

The fist difference in the sample paragraph is in this word:

لِأَنَّهُ [li'annahu]

In terms of rendering, the actual order of these two diacritics makes no difference whatsoever.

NFD: لِأَنَّهُ
NFC: لِأَنَّهُ


(I also attached the strings as a plain-text file.)

FYI: here are the specs: UNICODE ARABIC MARK ORDERING ALGORITHM.

In short, it says, if an Arabic letter is combined with both ARABIC SHADDA and other diacritics, e.g. ARABIC FATHA/KASRA/DAMMA, ARABIC SHADDA should be saved last, because it has a higher canonical value. This explains the differences that you found.

In real life it doesn't make any difference, because the strings are usually rendered exactly the same. Moreover, Arabic with multiple diacritics is primarily used in religious texts and some textbooks. In mass media, diacritics are rarely used and mostly only for disambiguation. I.e., this is mostly a cosmetic issue.
Since diacritics are also somewhat difficult to enter, some apps that support Arabic text, allow users to search for strings with diacritics as if they didn't have any diacritics. I.e., if the user could search for لأنه or لانه and the app would also find لِأَنَّهُ.

However, I don't know any EPUB app that has this option.



IMHO, having a dedicated NFC/NFD option would be overkill, since most Sigil users probably wouldn't know the difference between NFC and NFD anyway and there's no information loss and no rendering issues.



I totally agree with Kovid on this. Force converting NFD to NFC is the easier solution.
KevinH is online now   Reply With Quote