MobileRead Forums - View Single Post - Two characters look like one in Sigil, but wrong on the e-reader

Tex2002ans · 11-21-2022, 10:41 PM

Quote:

Originally Posted by repilo

Searching in Sigil for a text that looked strangely wrong in the e-reader I found the following character that looks like an accented e, but is actually two characters: é

Yes, this is a combination of 2 characters:

e = U+0065 = LATIN SMALL LETTER E
́ = U+0301 = COMBINING ACUTE ACCENT

You also have a single-character version:

é = U+00E9 = LATIN SMALL LETTER E WITH ACUTE

I tend to always side with the 1-character-combined if possible.

I'll explain all the technical details far below. (See "!!!Technical Notes!!!".)

- - -

How Can You Spot These "Dangling Accents"?

My favorite way is:

Tools > Reports > Characters in HTML Files

If you scroll down the list, you can easily spot odd characters in your books.

If you double-click on a character, Sigil will also auto-search and jump you to the next version of that character.

Side Note: See my answers in:

to find Soft Hyphens + Grave Accents. Same exact logic applies.

Quote:

Originally Posted by repilo

I have only been able to find this rare accent character by using the wildcard \p{M}

Yes, that's one way to search for any Unicode character... if you know the codepoint.

But better to just use the Reports. WAY easier!

Side Note: What I do, on every book, is run the Reports + skim through it. If I see something very odd—like an EM QUAD—I take a much closer look.

If you want even more tricks you can do with Reports... scroll allllllll the way down to the bottom of this post:

2021: "Tables in ePub"

where I link to a lot more threads.

- - - - - - -

!!!Technical Notes!!!

Better to Use the 1 Character Version?

Yes. I'd say, if it's available:

It's always better to use the combined-into-one Unicode character.

1 character version = less buggy with things like:

search
highlighting
copy/pasting
spellchecking
different fonts
rendering
...

Theoretically, the letter+accents vs. combined version should look exactly the same—in reality, some programs have oddities.

So if it exists in Unicode as a single character, USE IT.

- - -

What is the Advantage of 2+ Character Version?

This allows you to:

represent any possible combination of characters + accents.
attach ONE OR MORE accents to the previous character.

So if the single-character version does not exist in Unicode, you can still display it.

For example, there are languages that use a letter:

â = a + circumflex
- Used in Romanian.
î = i + circumflex
- Used in Turkish.
ĵ = j + circumflex
- Used in Esperanto.

but there's no such language that has a letter:

b̂ = "b-hat"

BUT, this type of thing could be used in Statistics, Physics, or Maths, so the only way to write it in Unicode is:

b = U+0062 = LATIN SMALL LETTER B
̂ = U+0302 = COMBINING CIRCUMFLEX ACCENT

(Same with c-hat, p-hat, x-hat, or any other weird letter combos.)

Let's say you wanted a 'b' + circumflex + a single dot below. All you'd have to do is type:

b = U+0062 = LATIN SMALL LETTER B
̂ = U+0302 = COMBINING CIRCUMFLEX ACCENT
̣ = U+0323 = COMBINING DOT BELOW

and the 2 "combining accent characters" will latch on to the previous valid character.

You can then type a letter + any amount of combining accents to create the symbols needed.

- - -

Side Note: But, sometimes renderers get really buggy with these unexpected combinations:

Click image for larger version

Name: b-hat.Plus.Dot.Below.png
Views: 239
Size: 4.0 KB
ID: 197911

1st row = a Math font.
2nd row = a normal font.

You can see in 2nd row, the second I added an dot below, the circumflex went crazy.

And in the 2nd image:

Click image for larger version

Name: b-hat.in.Different.Fonts.png
Views: 236
Size: 3.8 KB
ID: 197910

I used 4 different fonts and the circumflexes are all over the place. (That 4th font's accent even went flying to the bottom left corner!)

Side Note #2: You can even see some odd dotted/dotless letters only used in Gelic/Irish:

2021: "free font with no dot on i"

Almost all fonts ARE NOT expecting such weird combos, so these things are very rarely tested.

- - -

Examples In Real-Life (Multiple Accents + Character Doesn't Exist In Unicode)

Right now, I'm in the process of an extremely long-term conversion of an old dictionary:

Click image for larger version

Name: Dictionary.-.Multiple.Accents.Above.and.Below.png
Views: 218
Size: 139.5 KB
ID: 197909

For pronunciation, they used all sorts of weird accents:

single dots above/below
double dots above/below
breve below

Even combining different accents at the same time!

This would allow me to represent anything possible, by just using a letter plus the:

̄ = U+0304 = COMBINING MACRON
̇ = U+0307 = COMBINING DOT ABOVE
̣ = U+0323 = COMBINING DOT BELOW
̮ = U+032E = COMBINING BREVE BELOW
̈ = U+0308 = COMBINING DIAERESIS
̤ = U+0324 = COMBINING DIAERESIS BELOW

Quote:

Originally Posted by repilo

[...] rare characters that, at least in some cases, do not reproduce well on e-readers?

Yep, usually if you get weird � or '?' popping up, the font is missing that specific character.

Or you might be needing a "B-hat", but the fonts/renderers—like MobileRead—just aren't expecting such an odd combination:

See my response in:

2021: "Locking Fonts?"

referencing Hitch's + Jellby's B-hat (plus other weird combinations) in a Statistics book.

Most fonts just DO NOT handle that well. But a font designed for Maths/Science would probably make sure placement of circumflexes on arbitrary letters was tested MUCH more thoroughly.