MobileRead Forums - View Single Post - Using Regex Find/Change to change by Unicode

KevinH · 05-29-2024, 10:28 AM

I still do not understand how a non-utf-8 byte sequence E1 got into the file in the first place. Either the original xhtml was cp1251 or latin-1 encoded and did not indicate that when being read in so that it could properly be converted to utf-8, or a copy from a cp-1251 or latin-1 source was pasted in without proper conversion.

Either way, the find replace step should not be needed unless earlier steps broke someplace.

The actual font used has nothing really to do with reading in and properly encoding a text file. The problem typically comes from not properly specifying the original encoding of the file inside it near the top. Without that, Sigil's auto detection code can sometimes incorrectly guess the input encoding. Detecting the difference between latin-x/cp-125x and utf-8 is actually quite hard from small snippets of text.

05-29-2024, 10:28 AM	#9
KevinH Sigil Developer Posts: 9,072 Karma: 6361556 Join Date: Nov 2009 Device: many	I still do not understand how a non-utf-8 byte sequence E1 got into the file in the first place. Either the original xhtml was cp1251 or latin-1 encoded and did not indicate that when being read in so that it could properly be converted to utf-8, or a copy from a cp-1251 or latin-1 source was pasted in without proper conversion. Either way, the find replace step should not be needed unless earlier steps broke someplace. The actual font used has nothing really to do with reading in and properly encoding a text file. The problem typically comes from not properly specifying the original encoding of the file inside it near the top. Without that, Sigil's auto detection code can sometimes incorrectly guess the input encoding. Detecting the difference between latin-x/cp-125x and utf-8 is actually quite hard from small snippets of text.