Quote:
Originally Posted by repilo
Searching in Sigil for a text that looked strangely wrong in the e-reader I found the following character that looks like an accented e, but is actually two characters: é
|
Yes, this is a combination of 2 characters:
- e = U+0065 = LATIN SMALL LETTER E
- ́ = U+0301 = COMBINING ACUTE ACCENT
You also have a single-character version:
- é = U+00E9 = LATIN SMALL LETTER E WITH ACUTE
I tend to always side with the 1-character-combined if possible.
I'll explain all the technical details far below. (See "!!!Technical Notes!!!".)
- - -
How Can You Spot These "Dangling Accents"?
My favorite way is:
- Tools > Reports > Characters in HTML Files
If you scroll down the list, you can easily spot odd characters in your books.
If you double-click on a character, Sigil will also auto-search and jump you to the next version of that character.
Side Note: See my answers in:
to find Soft Hyphens + Grave Accents. Same exact logic applies.
Quote:
Originally Posted by repilo
I have only been able to find this rare accent character by using the wildcard \p{M}
|
Yes, that's one way to search for any Unicode character... if you know the codepoint.
But better to just use the Reports. WAY easier!
Side Note: What I do, on every book, is run the Reports + skim through it. If I see something very odd—
like an EM QUAD—I take a much closer look.
If you want even more tricks you can do with Reports... scroll allllllll the way down to the bottom of this post:
where I link to a lot more threads.
- - - - - - -
!!!Technical Notes!!!
Better to Use the 1 Character Version?
Yes. I'd say, if it's available:
- It's always better to use the combined-into-one Unicode character.
1 character version = less buggy with things like:
- search
- highlighting
- copy/pasting
- spellchecking
- different fonts
- rendering
- ...
Theoretically, the letter+accents vs. combined version
should look exactly the same—in reality, some programs have oddities.
So if it exists in Unicode as a single character, USE IT.
- - -
What is the Advantage of 2+ Character Version?
This allows you to:
- represent any possible combination of characters + accents.
- attach ONE OR MORE accents to the previous character.
So if the single-character version does not exist in Unicode, you can still display it.
For example, there are languages that use a letter:
- â = a + circumflex
- î = i + circumflex
- ĵ = j + circumflex
but there's no such language that has a letter:
BUT, this type of thing could be used in Statistics, Physics, or Maths, so the only way to write it in Unicode is:
- b = U+0062 = LATIN SMALL LETTER B
- ̂ = U+0302 = COMBINING CIRCUMFLEX ACCENT
(Same with c-hat, p-hat, x-hat, or any other weird letter combos.)
Let's say you wanted a 'b' + circumflex + a single dot below. All you'd have to do is type:
- b = U+0062 = LATIN SMALL LETTER B
- ̂ = U+0302 = COMBINING CIRCUMFLEX ACCENT
- ̣ = U+0323 = COMBINING DOT BELOW
and the 2 "combining accent characters" will latch on to the previous valid character.
You can then type a letter + any amount of combining accents to create the symbols needed.
- - -
Side Note: But, sometimes renderers get really buggy with these unexpected combinations:
- 1st row = a Math font.
- 2nd row = a normal font.
You can see in 2nd row, the second I added an dot below, the circumflex went crazy.
And in the 2nd image:
I used 4 different fonts and the circumflexes are all over the place. (That 4th font's accent even went flying to the bottom left corner!)
Side Note #2: You can even see some odd dotted/dotless letters only used in Gelic/Irish:
Almost all fonts ARE NOT expecting such weird combos, so these things are very rarely tested.
- - -
Examples In Real-Life (Multiple Accents + Character Doesn't Exist In Unicode)
Right now, I'm in the process of an extremely long-term conversion of an old dictionary:
For pronunciation, they used all sorts of weird accents:
- single dots above/below
- double dots above/below
- breve below
Even combining different accents at the same time!
This would allow me to represent anything possible, by just using a letter plus the:
- ̄ = U+0304 = COMBINING MACRON
- ̇ = U+0307 = COMBINING DOT ABOVE
- ̣ = U+0323 = COMBINING DOT BELOW
- ̮ = U+032E = COMBINING BREVE BELOW
- ̈ = U+0308 = COMBINING DIAERESIS
- ̤ = U+0324 = COMBINING DIAERESIS BELOW
Quote:
Originally Posted by repilo
[...] rare characters that, at least in some cases, do not reproduce well on e-readers?
|
Yep, usually if you get weird � or '?' popping up, the font is missing that specific character.
Or you might be needing a "B-hat", but the fonts/renderers—like MobileRead—just aren't expecting such an odd combination:
See my response in:
referencing Hitch's + Jellby's B-hat (plus other weird combinations) in a Statistics book.
Most fonts just DO NOT handle that well. But a font designed for Maths/Science would probably make sure placement of circumflexes on arbitrary letters was tested MUCH more thoroughly.