MobileRead Forums - View Single Post - Using Regex Find/Change to change by Unicode

KevinH · 05-29-2024, 09:55 AM

So your InDesign files are improperly mixing unicode (utf-8) encoded text with most likely latin-1 encoded text in the same file to match an older font.
Mixing two different text encodings in the same file breaks the xhtml spec completely. To Sigil on import it would look like pure utf-8 encoded text file but with rogue encoded chars.

At least that means there is no breakage in Sigil's import epub code, which is what I was worried about.

Quote:

Originally Posted by oston

The Indesign files where this happens are old files. The font-family used for the text are all 8-bit fonts where the diacriticals were coded at a different place in the family than the later Unicode fonts.
In the page that you saw in the capture, but top two lines of diacrits used TNR in the InDesign Document where ā is 0101, the now standard place for ā.
The lower two lines were in GaramondNo8BPS a very old font-family where the codes of the diacrits were not standard codes. ā is place at code 00E1.

I dont know if this answers your confusion, Kevin.

It would be helpful to be able to used the Sigil Regex find/replace to use
Find: \x{00E1}
Replace: \x{0101}
in the same way that I do in InDesign.

But as I said at this beginning, this is NOT a big issue for me. It's more a matter of just wanting to know what I am doing wrong with the Regex search and replace using HEX char strings.