MobileRead Forums - View Single Post - ALERT: Potential Issues with Sigil 2.2.X and rtl languages and Normalization Forms

Doitsu · 06-25-2024, 06:23 PM

I'm by no means a Unicode expert, but, AFAIK, most Windows apps do not perform Unicode normalizations when opening or saving files.
For example, when I entered the Häagen-Daz in Babelpad (freeware) and saved it as a UTF8 plain-text file with a BOM, it was shown as in a hex editor as: EF BB BF 48 C3 A4 61 67 65 6E 2D 44 61 7A
As expected, the ä umlaut was encoded as C3 A4.
When I entered the same string in LibreOffice and saved it a UTF-8 plain-text file with a BOM, I got the exact same result (+ 0D 0A at the end).
I'm not aware of any Windows app that has normalization options for accented characters.

Based on my experience, Arabic letters are usually saved as letters from the 0600–06FF range (Arabic), even though though many of the letters are actually rendered using glyphs from the 0600–06FF range (Arabic Presentation Forms-B).
Take for example, the Arabic word من (min = from). It consists of small circle on the right, which is the initial form of the letter MEEM and a semi-circle with a dot above it, which is the final form of the letter NOON.

When saved as a UTF-8 file, it was saved as D9 85 D9 86

MEEM U+0645 م d9 85
NOON U+0646 ن d9 86

Quote:

Originally Posted by KevinH

I do not even know how to input from a keyboard to produce Hebrew or Arabic so I am at a loss. I can only cut and paste from somewhere else but who knows if the test copied was generated in NFD or NFC form or unknown form.

There's an Arabic website that allows you to enter Arabic words phonetically (using Latin characters): Yamli.
Please visit it, enter the sample word marhaban = welcome, and click on the first suggestion (مرحباً). Then copy it to a Mac editor and save it.
On my machine it was saved as: D9 85 D8 B1 D8 AD D8 A8 D8 A7 D9 8B
I.e. the codes for MEEM, REH, HAH, BA, ALEF, FATHATAN.

I don't know what problems the Mac user reported, but, IIRC, very old versions of InDesign and other DTP apps came with RTL plugins that replaced characters from the 0600–06FF range with presentation forms from the 0600–06FF range.
Visually, the words would look exactly the same. Take again the word من, if you encode it using characters from 0600–06FF range, it would look like this: ﻣﻦ
You could only tell the difference if you saved the string and examined it with a a hex editor.

INITIAL MEEM U+FEE3 ﻣ ef bb a3
FINAL NOON U+FEE6 ﻦ ef bb a6

(In Hebrew, only five letters have a different final form.)

I.e., it's quite possible that the user who reported the RTL problem is using outdated software or software that can't handle RTL text.

In Sigil 2.2.1, my accented characters test file, is still rendered correctly on my Windows 11 machine.
What does the book browser look like on a Mac?