11-21-2022, 04:05 PM | #1 |
Enthusiast
Posts: 43
Karma: 10
Join Date: Apr 2021
Location: Spain
Device: Kobo Libra 2
|
Two characters look like one in Sigil, but wrong on the e-reader
Searching in Sigil for a text that looked strangely wrong in the e-reader I found the following character that looks like an accented e, but is actually two characters: é
When deleting it backwards, the accented e is not deleted as a single char, as it should be, but first only the accent is deleted. I have only been able to find this rare accent character by using the wildcard \p{M} I suppose there may be other variants with a character followed by this strange accent (though not in this epub). Any suggestions for Find/Replace that could fix a generic case? Would it be possible to consider adding to Sigil the search (if not fix) for these rare characters that, at least in some cases, do not reproduce well on e-readers? In any case, thank you very much. |
11-21-2022, 04:15 PM | #2 | |
Sigil Developer
Posts: 7,645
Karma: 5433388
Join Date: Nov 2009
Device: many
|
Composed characters (made up of two of more separate unicode codepoints) are allowed according to the unicode spec. Even when a single codepoint version exists. Unicode uses normalization to simplify ordering of these multi codepoint sequences.
See https://unicode.org/reports/tr15/#Norm_Forms From the html spec: Quote:
So I would consider this a failure of your reading system for not fully supporting unicode. That said, one way to deal with this instead of searching for all combinations of composed characters is to read in each file and be sure to NFC normalize it before writing the file out. This is simple in python and could be easily part of a plugin. Code:
>>> print(ascii(unicodedata.normalize('NFC', '\u0061\u0301'))) '\xe1' Code:
QString normalized(QString::NormalizationForm mode, QChar::UnicodeVersion version = QChar::Unicode_Unassigned) const Last edited by KevinH; 11-21-2022 at 04:55 PM. |
|
11-21-2022, 10:41 PM | #3 | |||
Wizard
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
You also have a single-character version:
I tend to always side with the 1-character-combined if possible. I'll explain all the technical details far below. (See "!!!Technical Notes!!!".) - - - How Can You Spot These "Dangling Accents"? My favorite way is:
If you scroll down the list, you can easily spot odd characters in your books. If you double-click on a character, Sigil will also auto-search and jump you to the next version of that character. Side Note: See my answers in:
to find Soft Hyphens + Grave Accents. Same exact logic applies. Quote:
But better to just use the Reports. WAY easier! Side Note: What I do, on every book, is run the Reports + skim through it. If I see something very odd—like an EM QUAD—I take a much closer look. If you want even more tricks you can do with Reports... scroll allllllll the way down to the bottom of this post: where I link to a lot more threads. - - - - - - - !!!Technical Notes!!! Better to Use the 1 Character Version? Yes. I'd say, if it's available:
1 character version = less buggy with things like:
Theoretically, the letter+accents vs. combined version should look exactly the same—in reality, some programs have oddities. So if it exists in Unicode as a single character, USE IT. - - - What is the Advantage of 2+ Character Version? This allows you to:
So if the single-character version does not exist in Unicode, you can still display it. For example, there are languages that use a letter:
but there's no such language that has a letter:
BUT, this type of thing could be used in Statistics, Physics, or Maths, so the only way to write it in Unicode is:
(Same with c-hat, p-hat, x-hat, or any other weird letter combos.) Let's say you wanted a 'b' + circumflex + a single dot below. All you'd have to do is type:
and the 2 "combining accent characters" will latch on to the previous valid character. You can then type a letter + any amount of combining accents to create the symbols needed. - - - Side Note: But, sometimes renderers get really buggy with these unexpected combinations:
You can see in 2nd row, the second I added an dot below, the circumflex went crazy. And in the 2nd image: I used 4 different fonts and the circumflexes are all over the place. (That 4th font's accent even went flying to the bottom left corner!) Side Note #2: You can even see some odd dotted/dotless letters only used in Gelic/Irish: Almost all fonts ARE NOT expecting such weird combos, so these things are very rarely tested. - - - Examples In Real-Life (Multiple Accents + Character Doesn't Exist In Unicode) Right now, I'm in the process of an extremely long-term conversion of an old dictionary: For pronunciation, they used all sorts of weird accents:
Even combining different accents at the same time! This would allow me to represent anything possible, by just using a letter plus the:
Quote:
Or you might be needing a "B-hat", but the fonts/renderers—like MobileRead—just aren't expecting such an odd combination: See my response in: referencing Hitch's + Jellby's B-hat (plus other weird combinations) in a Statistics book. Most fonts just DO NOT handle that well. But a font designed for Maths/Science would probably make sure placement of circumflexes on arbitrary letters was tested MUCH more thoroughly. Last edited by Tex2002ans; 11-21-2022 at 11:05 PM. |
|||
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
ebook-convert output pdf with wrong chinese characters | xiatian | Conversion | 1 | 12-25-2018 09:39 PM |
Wrong characters on a MOBI file | kindlefireHD | Kindle Formats | 4 | 12-06-2013 08:58 AM |
Special Characters in Sigil | paulhypno | Sigil | 3 | 11-18-2012 08:46 AM |
Troubleshooting Wrong display of non-english characters in book title & author name. | smrtihlav | Amazon Kindle | 0 | 04-28-2011 10:32 AM |
Unicode characters OK in text but wrong in TOC | paulpeer | ePub | 8 | 01-15-2010 06:17 PM |