MobileRead Forums - View Single Post - ALERT: Potential Issues with Sigil 2.2.X and rtl languages and Normalization Forms

KevinH · 06-25-2024, 01:33 PM

Hi All,

I NEED YOUR HELP:

The original epub2 spec said that all Content documents must use Unicode Normalization Form C (NFC).

The epub3 spec now says that all file paths and urls must use Normalization Form C.

Calibre, as far as I can tell enforces NFC for every xhtml file read in.

So I need help from users who work with RTL languages and also LTR languages that use lots of accents.

Starting in Sigil 2.2.0, for every file read in, the input was converted to use NFC.

This in turn has caused problems with RTL languages like Hebrew and Arabic while fixing other issues for other heavy accent languages and search.

From the unicode.org spec, both Hebrew and Arabic use special combining character classes and they should appear the same if input is NFC or NFD - but will *not* compare as identical due to byte order changes, and accent and character combinations.

Either way mixing some text as NFC form and in other places the same text in NFD form will cause much pain and headaches for anyone trying to use Find and Replace or the end epub reader doing a search.

So what I would like to know for *each* platform:

1. If your *type on your keyboard* and input Hebrew or Arabic or any language with lots of accents is that text stored in form NFC or form NFD, or some other mixed form inside the document you are editing (under Word, LibreOffice, Kate, emacs, or whatever text editor you use).

2. Try the same text copying to the clipboard from your web browser and pasting it into a text editor, what unicode normaliztion form does it use.

When running this test, just save the text directly to a file and post it as a zip here (in this thread) along with info on language, platform, editor used, and source (typing in the keyboard vs copy and paste from clipboard). I can determine the form used by converting to utf-8 and dumping the hex code.

For Arabic and Hebrew users of Sigil, I would revert to using Sigil 2.1.0 until we find out what is going on with unicode normalization with RTL languages. Especially those on macOS.

If anyone is an expert on Unicode Normalization Forms and especially how it is handled for RTL languages, and whether normal keyboard input on each platform generates NFC vs NFD form, I would love for you to post what you know here.

I do not even know how to input from a keyboard to produce Hebrew or Arabic so I am at a loss. I can only cut and paste from somewhere else but who knows if the test copied was generated in NFD or NFC form or unknown form.

The biggest issue here is NOT readability or lost text (no losses occur) but instead find and replace and end user Search. To the end user the text will appear correct but if it differs from what they type on the keyboard or cut and paste into find, the search will come back empty.

Any help or guidance here would be greatly appreciated.

Thank you!

06-25-2024, 01:33 PM	#1
KevinH Sigil Developer Posts: 9,073 Karma: 6361556 Join Date: Nov 2009 Device: many	ALERT: Potential Issues with Sigil 2.2.X and rtl languages and Normalization Forms Hi All, I NEED YOUR HELP: The original epub2 spec said that all Content documents must use Unicode Normalization Form C (NFC). The epub3 spec now says that all file paths and urls must use Normalization Form C. Calibre, as far as I can tell enforces NFC for every xhtml file read in. So I need help from users who work with RTL languages and also LTR languages that use lots of accents. Starting in Sigil 2.2.0, for every file read in, the input was converted to use NFC. This in turn has caused problems with RTL languages like Hebrew and Arabic while fixing other issues for other heavy accent languages and search. From the unicode.org spec, both Hebrew and Arabic use special combining character classes and they should appear the same if input is NFC or NFD - but will not compare as identical due to byte order changes, and accent and character combinations. Either way mixing some text as NFC form and in other places the same text in NFD form will cause much pain and headaches for anyone trying to use Find and Replace or the end epub reader doing a search. So what I would like to know for each platform: 1. If your type on your keyboard and input Hebrew or Arabic or any language with lots of accents is that text stored in form NFC or form NFD, or some other mixed form inside the document you are editing (under Word, LibreOffice, Kate, emacs, or whatever text editor you use). 2. Try the same text copying to the clipboard from your web browser and pasting it into a text editor, what unicode normaliztion form does it use. When running this test, just save the text directly to a file and post it as a zip here (in this thread) along with info on language, platform, editor used, and source (typing in the keyboard vs copy and paste from clipboard). I can determine the form used by converting to utf-8 and dumping the hex code. For Arabic and Hebrew users of Sigil, I would revert to using Sigil 2.1.0 until we find out what is going on with unicode normalization with RTL languages. Especially those on macOS. If anyone is an expert on Unicode Normalization Forms and especially how it is handled for RTL languages, and whether normal keyboard input on each platform generates NFC vs NFD form, I would love for you to post what you know here. I do not even know how to input from a keyboard to produce Hebrew or Arabic so I am at a loss. I can only cut and paste from somewhere else but who knows if the test copied was generated in NFD or NFC form or unknown form. The biggest issue here is NOT readability or lost text (no losses occur) but instead find and replace and end user Search. To the end user the text will appear correct but if it differs from what they type on the keyboard or cut and paste into find, the search will come back empty. Any help or guidance here would be greatly appreciated. Thank you! Last edited by KevinH; 06-25-2024 at 01:37 PM.