MobileRead Forums - View Single Post - Making sense of faulty HTML to EPUB conversion

Shohreh · 12-26-2024, 01:57 PM

Yes, data must be read in binary and with the right decoder.

The extension only supports utf-8 and doesn't throw an error if a web page uses another encoding,eg. Latin1/iso-8859-1. It's the first time I had the issue in the weeks I've been using it, so it's no biggie. It was the opportunity to understand how both encodings work.

For the curious in the audience, here's how utf-8 works:
1. If a byte is worth 0-127, it remains untouched
2. If it's 128-159, it's considered wrong and replaced with the sequence "0xEFBFBD", ie. "�"
3. If it's 160-255, it's the leading byte of a two-byte combo

For instance, "É" in ISO-8859-1 is 0xC9 or 11001001 in binary. To convert it to utf-8, the first two bits (11) are put in the leading byte (11000011) and the other bits are put in the trailing byte (10001001) → 0xC389.

https://en.wikipedia.org/wiki/UTF-8#Description

12-26-2024, 01:57 PM	#8
Shohreh Addict Posts: 222 Karma: 304158 Join Date: Jan 2016 Location: France Device: none	Yes, data must be read in binary and with the right decoder. The extension only supports utf-8 and doesn't throw an error if a web page uses another encoding,eg. Latin1/iso-8859-1. It's the first time I had the issue in the weeks I've been using it, so it's no biggie. It was the opportunity to understand how both encodings work. For the curious in the audience, here's how utf-8 works: 1. If a byte is worth 0-127, it remains untouched 2. If it's 128-159, it's considered wrong and replaced with the sequence "0xEFBFBD", ie. "�" 3. If it's 160-255, it's the leading byte of a two-byte combo For instance, "É" in ISO-8859-1 is 0xC9 or 11001001 in binary. To convert it to utf-8, the first two bits (11) are put in the leading byte (11000011) and the other bits are put in the trailing byte (10001001) → 0xC389. https://en.wikipedia.org/wiki/UTF-8#Description Last edited by Shohreh; 12-26-2024 at 01:59 PM.