Two characters look like one in Sigil, but wrong on the e-reader

repilo · 11-21-2022, 04:05 PM

Searching in Sigil for a text that looked strangely wrong in the e-reader I found the following character that looks like an accented e, but is actually two characters: é
When deleting it backwards, the accented e is not deleted as a single char, as it should be, but first only the accent is deleted.
I have only been able to find this rare accent character by using the wildcard \p{M}
I suppose there may be other variants with a character followed by this strange accent (though not in this epub).
Any suggestions for Find/Replace that could fix a generic case?
Would it be possible to consider adding to Sigil the search (if not fix) for these rare characters that, at least in some cases, do not reproduce well on e-readers?
In any case, thank you very much.

KevinH · 11-21-2022, 04:15 PM

Composed characters (made up of two of more separate unicode codepoints) are allowed according to the unicode spec. Even when a single codepoint version exists. Unicode uses normalization to simplify ordering of these multi codepoint sequences.

See https://unicode.org/reports/tr15/#Norm_Forms

From the html spec:

Quote:

One might think that this would be a serious problem. However, most software systems consistently use a single Unicode representation to represent most languages/scripts, even though multiple representations are theoretically possible in Unicode. This form is typically very similar to Unicode Normalization Form C (or "NFC"), in which as many combining marks as possible are combined with base characters to form a single code point (NFC also specifies the order in which combining marks that cannot be combined appear; Unicode normalization forms do not guarantee that there will be no combining marks, as some languages/scripts cannot be encoded at all except via the use of combining characters). As a result, few users encounter issues with Unicode canonical equivalence. A recent survey of the Web concluded that over 99% of all content is in NFC.

The epub3 spec is mute on this topic except to say that file paths in the opf and urls must be NFC normalized.

So I would consider this a failure of your reading system for not fully supporting unicode.

That said, one way to deal with this instead of searching for all combinations of composed characters is to read in each file and be sure to NFC normalize it before writing the file out.

This is simple in python and could be easily part of a plugin.

Code:

>>> print(ascii(unicodedata.normalize('NFC', '\u0061\u0301')))
'\xe1'

And is simple to do in C++/Qt as well.

Code:

QString	normalized(QString::NormalizationForm mode, QChar::UnicodeVersion version = QChar::Unicode_Unassigned) const

And some code editors like emacs and etc have a way to normalize text files as well.

Tex2002ans · 11-21-2022, 10:41 PM

Quote:

Originally Posted by repilo

Searching in Sigil for a text that looked strangely wrong in the e-reader I found the following character that looks like an accented e, but is actually two characters: é

Yes, this is a combination of 2 characters:

e = U+0065 = LATIN SMALL LETTER E
́ = U+0301 = COMBINING ACUTE ACCENT

You also have a single-character version:

é = U+00E9 = LATIN SMALL LETTER E WITH ACUTE

I tend to always side with the 1-character-combined if possible.

I'll explain all the technical details far below. (See "!!!Technical Notes!!!".)

- - -

How Can You Spot These "Dangling Accents"?

My favorite way is:

Tools > Reports > Characters in HTML Files

If you scroll down the list, you can easily spot odd characters in your books.

If you double-click on a character, Sigil will also auto-search and jump you to the next version of that character.

Side Note: See my answers in:

to find Soft Hyphens + Grave Accents. Same exact logic applies.

Quote:

Originally Posted by repilo

I have only been able to find this rare accent character by using the wildcard \p{M}

Yes, that's one way to search for any Unicode character... if you know the codepoint.

But better to just use the Reports. WAY easier!

Side Note: What I do, on every book, is run the Reports + skim through it. If I see something very odd—like an EM QUAD—I take a much closer look.

If you want even more tricks you can do with Reports... scroll allllllll the way down to the bottom of this post:

2021: "Tables in ePub"

where I link to a lot more threads.

- - - - - - -

!!!Technical Notes!!!

Better to Use the 1 Character Version?

Yes. I'd say, if it's available:

It's always better to use the combined-into-one Unicode character.

1 character version = less buggy with things like:

search
highlighting
copy/pasting
spellchecking
different fonts
rendering
...

Theoretically, the letter+accents vs. combined version should look exactly the same—in reality, some programs have oddities.

So if it exists in Unicode as a single character, USE IT.

- - -

What is the Advantage of 2+ Character Version?

This allows you to:

represent any possible combination of characters + accents.
attach ONE OR MORE accents to the previous character.

So if the single-character version does not exist in Unicode, you can still display it.

For example, there are languages that use a letter:

â = a + circumflex
- Used in Romanian.
î = i + circumflex
- Used in Turkish.
ĵ = j + circumflex
- Used in Esperanto.

but there's no such language that has a letter:

b̂ = "b-hat"

BUT, this type of thing could be used in Statistics, Physics, or Maths, so the only way to write it in Unicode is:

b = U+0062 = LATIN SMALL LETTER B
̂ = U+0302 = COMBINING CIRCUMFLEX ACCENT

(Same with c-hat, p-hat, x-hat, or any other weird letter combos.)

Let's say you wanted a 'b' + circumflex + a single dot below. All you'd have to do is type:

b = U+0062 = LATIN SMALL LETTER B
̂ = U+0302 = COMBINING CIRCUMFLEX ACCENT
̣ = U+0323 = COMBINING DOT BELOW

and the 2 "combining accent characters" will latch on to the previous valid character.

You can then type a letter + any amount of combining accents to create the symbols needed.

- - -

Side Note: But, sometimes renderers get really buggy with these unexpected combinations:

Click image for larger version

Name: b-hat.Plus.Dot.Below.png
Views: 74
Size: 4.0 KB
ID: 197911

1st row = a Math font.
2nd row = a normal font.

You can see in 2nd row, the second I added an dot below, the circumflex went crazy.

And in the 2nd image:

Click image for larger version

Name: b-hat.in.Different.Fonts.png
Views: 77
Size: 3.8 KB
ID: 197910

I used 4 different fonts and the circumflexes are all over the place. (That 4th font's accent even went flying to the bottom left corner!)

Side Note #2: You can even see some odd dotted/dotless letters only used in Gelic/Irish:

2021: "free font with no dot on i"

Almost all fonts ARE NOT expecting such weird combos, so these things are very rarely tested.

- - -

Examples In Real-Life (Multiple Accents + Character Doesn't Exist In Unicode)

Right now, I'm in the process of an extremely long-term conversion of an old dictionary:

Click image for larger version

Name: Dictionary.-.Multiple.Accents.Above.and.Below.png
Views: 67
Size: 139.5 KB
ID: 197909

For pronunciation, they used all sorts of weird accents:

single dots above/below
double dots above/below
breve below

Even combining different accents at the same time!

This would allow me to represent anything possible, by just using a letter plus the:

̄ = U+0304 = COMBINING MACRON
̇ = U+0307 = COMBINING DOT ABOVE
̣ = U+0323 = COMBINING DOT BELOW
̮ = U+032E = COMBINING BREVE BELOW
̈ = U+0308 = COMBINING DIAERESIS
̤ = U+0324 = COMBINING DIAERESIS BELOW

Quote:

Originally Posted by repilo

[...] rare characters that, at least in some cases, do not reproduce well on e-readers?

Yep, usually if you get weird � or '?' popping up, the font is missing that specific character.

Or you might be needing a "B-hat", but the fonts/renderers—like MobileRead—just aren't expecting such an odd combination:

See my response in:

2021: "Locking Fonts?"

referencing Hitch's + Jellby's B-hat (plus other weird combinations) in a Statistics book.

Most fonts just DO NOT handle that well. But a font designed for Maths/Science would probably make sure placement of circumflexes on arbitrary letters was tested MUCH more thoroughly.

11-21-2022, 04:05 PM	#1
repilo Enthusiast Posts: 43 Karma: 10 Join Date: Apr 2021 Location: Spain Device: Kobo Libra 2	Two characters look like one in Sigil, but wrong on the e-reader Searching in Sigil for a text that looked strangely wrong in the e-reader I found the following character that looks like an accented e, but is actually two characters: é When deleting it backwards, the accented e is not deleted as a single char, as it should be, but first only the accent is deleted. I have only been able to find this rare accent character by using the wildcard \p{M} I suppose there may be other variants with a character followed by this strange accent (though not in this epub). Any suggestions for Find/Replace that could fix a generic case? Would it be possible to consider adding to Sigil the search (if not fix) for these rare characters that, at least in some cases, do not reproduce well on e-readers? In any case, thank you very much.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
ebook-convert output pdf with wrong chinese characters	xiatian	Conversion	1	12-25-2018 09:39 PM
Wrong characters on a MOBI file	kindlefireHD	Kindle Formats	4	12-06-2013 08:58 AM
Special Characters in Sigil	paulhypno	Sigil	3	11-18-2012 08:46 AM
Troubleshooting Wrong display of non-english characters in book title & author name.	smrtihlav	Amazon Kindle	0	04-28-2011 10:32 AM
Unicode characters OK in text but wrong in TOC	paulpeer	ePub	8	01-15-2010 06:17 PM