MobileRead Forums - View Single Post - Making sense of faulty HTML to EPUB conversion

Shohreh · 12-24-2024, 09:51 AM

Looking at the EPUB in hexmode, I see that each accented character is replaced with "EF BF BD".

Here's the explanation: "The sequence "ef bf bd" is UTF-8 for U+FFFD (REPLACEMENT CHARACTER), i.e., a special code that is shown as "�", as mentioned in your question. Therefor, something (Python?) must have replaced the original char with this code. So your terminal appears to be okay.

The 'é' character U+00E9 (LATIN SMALL LETTER E WITH ACUTE) in UTF-8 would read "c3 a9" instead.

It is conceivable that your original subtitle might be encoded as CP1252, where the 'e' is represented by code 0xe9. Since the next byte is 0x72 ('r'), your parser might have interpreted the 0xe9 as an incomplete UTF-8 sequence and therefor replaced the "e9" with "ef bf bd" (REPLACEMENT CHARACTER)." (source)

So once a string/file has been corrupted by replacing each problematic character into "EF BF BD"… there's no going back to the original data other than fixing errors manually (if you know the original language).

12-24-2024, 09:51 AM	#6
Shohreh Addict Posts: 222 Karma: 304158 Join Date: Jan 2016 Location: France Device: none	Looking at the EPUB in hexmode, I see that each accented character is replaced with "EF BF BD". Here's the explanation: "The sequence "ef bf bd" is UTF-8 for U+FFFD (REPLACEMENT CHARACTER), i.e., a special code that is shown as "�", as mentioned in your question. Therefor, something (Python?) must have replaced the original char with this code. So your terminal appears to be okay. The 'é' character U+00E9 (LATIN SMALL LETTER E WITH ACUTE) in UTF-8 would read "c3 a9" instead. It is conceivable that your original subtitle might be encoded as CP1252, where the 'e' is represented by code 0xe9. Since the next byte is 0x72 ('r'), your parser might have interpreted the 0xe9 as an incomplete UTF-8 sequence and therefor replaced the "e9" with "ef bf bd" (REPLACEMENT CHARACTER)." (source) So once a string/file has been corrupted by replacing each problematic character into "EF BF BD"… there's no going back to the original data other than fixing errors manually (if you know the original language). Last edited by Shohreh; 12-24-2024 at 10:02 AM.