View Single Post
Old 11-21-2022, 10:41 PM   #3
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by repilo View Post
Searching in Sigil for a text that looked strangely wrong in the e-reader I found the following character that looks like an accented e, but is actually two characters: é
Yes, this is a combination of 2 characters:
  • e = U+0065 = LATIN SMALL LETTER E
  • ́ = U+0301 = COMBINING ACUTE ACCENT

You also have a single-character version:
  • é = U+00E9 = LATIN SMALL LETTER E WITH ACUTE

I tend to always side with the 1-character-combined if possible.

I'll explain all the technical details far below. (See "!!!Technical Notes!!!".)

- - -

How Can You Spot These "Dangling Accents"?

My favorite way is:
  • Tools > Reports > Characters in HTML Files

If you scroll down the list, you can easily spot odd characters in your books.

If you double-click on a character, Sigil will also auto-search and jump you to the next version of that character.

Side Note: See my answers in:

to find Soft Hyphens + Grave Accents. Same exact logic applies.

Quote:
Originally Posted by repilo View Post
I have only been able to find this rare accent character by using the wildcard \p{M}
Yes, that's one way to search for any Unicode character... if you know the codepoint.

But better to just use the Reports. WAY easier!

Side Note: What I do, on every book, is run the Reports + skim through it. If I see something very odd—like an EM QUAD—I take a much closer look.

If you want even more tricks you can do with Reports... scroll allllllll the way down to the bottom of this post:

where I link to a lot more threads.

- - - - - - -

!!!Technical Notes!!!

Better to Use the 1 Character Version?

Yes. I'd say, if it's available:
  • It's always better to use the combined-into-one Unicode character.

1 character version = less buggy with things like:
  • search
  • highlighting
  • copy/pasting
  • spellchecking
  • different fonts
  • rendering
  • ...

Theoretically, the letter+accents vs. combined version should look exactly the same—in reality, some programs have oddities.

So if it exists in Unicode as a single character, USE IT.

- - -

What is the Advantage of 2+ Character Version?

This allows you to:
  • represent any possible combination of characters + accents.
  • attach ONE OR MORE accents to the previous character.

So if the single-character version does not exist in Unicode, you can still display it.

For example, there are languages that use a letter:
  • â = a + circumflex
    • Used in Romanian.
  • î = i + circumflex
    • Used in Turkish.
  • ĵ = j + circumflex
    • Used in Esperanto.

but there's no such language that has a letter:
  • b̂ = "b-hat"

BUT, this type of thing could be used in Statistics, Physics, or Maths, so the only way to write it in Unicode is:
  • b = U+0062 = LATIN SMALL LETTER B
  • ̂ = U+0302 = COMBINING CIRCUMFLEX ACCENT

(Same with c-hat, p-hat, x-hat, or any other weird letter combos.)

Let's say you wanted a 'b' + circumflex + a single dot below. All you'd have to do is type:
  • b = U+0062 = LATIN SMALL LETTER B
  • ̂ = U+0302 = COMBINING CIRCUMFLEX ACCENT
  • ̣ = U+0323 = COMBINING DOT BELOW

and the 2 "combining accent characters" will latch on to the previous valid character.

You can then type a letter + any amount of combining accents to create the symbols needed.

- - -

Side Note: But, sometimes renderers get really buggy with these unexpected combinations:

Click image for larger version

Name:	b-hat.Plus.Dot.Below.png
Views:	239
Size:	4.0 KB
ID:	197911
  • 1st row = a Math font.
  • 2nd row = a normal font.

You can see in 2nd row, the second I added an dot below, the circumflex went crazy.

And in the 2nd image:

Click image for larger version

Name:	b-hat.in.Different.Fonts.png
Views:	236
Size:	3.8 KB
ID:	197910

I used 4 different fonts and the circumflexes are all over the place. (That 4th font's accent even went flying to the bottom left corner!)

Side Note #2: You can even see some odd dotted/dotless letters only used in Gelic/Irish:

Almost all fonts ARE NOT expecting such weird combos, so these things are very rarely tested.

- - -

Examples In Real-Life (Multiple Accents + Character Doesn't Exist In Unicode)

Right now, I'm in the process of an extremely long-term conversion of an old dictionary:

Click image for larger version

Name:	Dictionary.-.Multiple.Accents.Above.and.Below.png
Views:	218
Size:	139.5 KB
ID:	197909

For pronunciation, they used all sorts of weird accents:
  • single dots above/below
  • double dots above/below
  • breve below

Even combining different accents at the same time!

This would allow me to represent anything possible, by just using a letter plus the:
  • ̄ = U+0304 = COMBINING MACRON
  • ̇ = U+0307 = COMBINING DOT ABOVE
  • ̣ = U+0323 = COMBINING DOT BELOW
  • ̮ = U+032E = COMBINING BREVE BELOW
  • ̈ = U+0308 = COMBINING DIAERESIS
  • ̤ = U+0324 = COMBINING DIAERESIS BELOW

Quote:
Originally Posted by repilo View Post
[...] rare characters that, at least in some cases, do not reproduce well on e-readers?
Yep, usually if you get weird � or '?' popping up, the font is missing that specific character.

Or you might be needing a "B-hat", but the fonts/renderers—like MobileRead—just aren't expecting such an odd combination:

See my response in:

referencing Hitch's + Jellby's B-hat (plus other weird combinations) in a Statistics book.

Most fonts just DO NOT handle that well. But a font designed for Maths/Science would probably make sure placement of circumflexes on arbitrary letters was tested MUCH more thoroughly.

Last edited by Tex2002ans; 11-21-2022 at 11:05 PM.
Tex2002ans is offline   Reply With Quote