View Single Post
Old 12-24-2024, 08:51 AM   #6
Shohreh
Addict
Shohreh ought to be getting tired of karma fortunes by now.Shohreh ought to be getting tired of karma fortunes by now.Shohreh ought to be getting tired of karma fortunes by now.Shohreh ought to be getting tired of karma fortunes by now.Shohreh ought to be getting tired of karma fortunes by now.Shohreh ought to be getting tired of karma fortunes by now.Shohreh ought to be getting tired of karma fortunes by now.Shohreh ought to be getting tired of karma fortunes by now.Shohreh ought to be getting tired of karma fortunes by now.Shohreh ought to be getting tired of karma fortunes by now.Shohreh ought to be getting tired of karma fortunes by now.
 
Posts: 207
Karma: 304158
Join Date: Jan 2016
Location: France
Device: none
Looking at the EPUB in hexmode, I see that each accented character is replaced with "EF BF BD".

Here's the explanation: "The sequence "ef bf bd" is UTF-8 for U+FFFD (REPLACEMENT CHARACTER), i.e., a special code that is shown as "�", as mentioned in your question. Therefor, something (Python?) must have replaced the original char with this code. So your terminal appears to be okay.

The 'é' character U+00E9 (LATIN SMALL LETTER E WITH ACUTE) in UTF-8 would read "c3 a9" instead.

It is conceivable that your original subtitle might be encoded as CP1252, where the 'e' is represented by code 0xe9. Since the next byte is 0x72 ('r'), your parser might have interpreted the 0xe9 as an incomplete UTF-8 sequence and therefor replaced the "e9" with "ef bf bd" (REPLACEMENT CHARACTER)." (source)

So once a string/file has been corrupted by replacing each problematic character into "EF BF BD"… there's no going back to the original data other than fixing errors manually (if you know the original language).

Last edited by Shohreh; 12-24-2024 at 09:02 AM.
Shohreh is offline   Reply With Quote