View Single Post
Old 06-25-2024, 05:23 PM   #3
Doitsu
Grand Sorcerer
Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.
 
Doitsu's Avatar
 
Posts: 5,731
Karma: 24031401
Join Date: Dec 2010
Device: Kindle PW2
I'm by no means a Unicode expert, but, AFAIK, most Windows apps do not perform Unicode normalizations when opening or saving files.
For example, when I entered the Häagen-Daz in Babelpad (freeware) and saved it as a UTF8 plain-text file with a BOM, it was shown as in a hex editor as: EF BB BF 48 C3 A4 61 67 65 6E 2D 44 61 7A
As expected, the ä umlaut was encoded as C3 A4.
When I entered the same string in LibreOffice and saved it a UTF-8 plain-text file with a BOM, I got the exact same result (+ 0D 0A at the end).
I'm not aware of any Windows app that has normalization options for accented characters.

Based on my experience, Arabic letters are usually saved as letters from the 0600–06FF range (Arabic), even though though many of the letters are actually rendered using glyphs from the 0600–06FF range (Arabic Presentation Forms-B).
Take for example, the Arabic word من (min = from). It consists of small circle on the right, which is the initial form of the letter MEEM and a semi-circle with a dot above it, which is the final form of the letter NOON.

When saved as a UTF-8 file, it was saved as D9 85 D9 86

MEEM U+0645 م d9 85
NOON U+0646 ن d9 86

Quote:
Originally Posted by KevinH View Post
I do not even know how to input from a keyboard to produce Hebrew or Arabic so I am at a loss. I can only cut and paste from somewhere else but who knows if the test copied was generated in NFD or NFC form or unknown form.
There's an Arabic website that allows you to enter Arabic words phonetically (using Latin characters): Yamli.
Please visit it, enter the sample word marhaban = welcome, and click on the first suggestion (مرحباً). Then copy it to a Mac editor and save it.
On my machine it was saved as: D9 85 D8 B1 D8 AD D8 A8 D8 A7 D9 8B
I.e. the codes for MEEM, REH, HAH, BA, ALEF, FATHATAN.

I don't know what problems the Mac user reported, but, IIRC, very old versions of InDesign and other DTP apps came with RTL plugins that replaced characters from the 0600–06FF range with presentation forms from the 0600–06FF range.
Visually, the words would look exactly the same. Take again the word من, if you encode it using characters from 0600–06FF range, it would look like this: ﻣﻦ
You could only tell the difference if you saved the string and examined it with a a hex editor.

INITIAL MEEM U+FEE3 ﻣ ef bb a3
FINAL NOON U+FEE6 ﻦ ef bb a6

(In Hebrew, only five letters have a different final form.)

I.e., it's quite possible that the user who reported the RTL problem is using outdated software or software that can't handle RTL text.

In Sigil 2.2.1, my accented characters test file, is still rendered correctly on my Windows 11 machine.
What does the book browser look like on a Mac?
Attached Thumbnails
Click image for larger version

Name:	rtl.png
Views:	973
Size:	3.0 KB
ID:	209145  

Last edited by Doitsu; 06-25-2024 at 05:27 PM.
Doitsu is offline   Reply With Quote