Making sense of faulty HTML to EPUB conversion

Shohreh · 12-23-2024, 05:26 PM

Hello,

I'm using a browser extension to convert HTML pages into EPUB files.

It works fine when the web page is in utf-8 but it doesn't like web pages encoded in iso-8859-1, where accented characters are replaced with question marks — the attached screenshots are when opening the EPUB file in SumatraPDF and Sigil.

To make matters worse, the extension replaces the encoding meta line with "charset="iso-8859-1".

I'd like to understand why accented characters are replaced with question marks. Is it a font issue? Or a problem with byte values?

Thank you.

---
Edit: I should have typed "the extension replaces the encoding meta line with "charset="utf-8"

KevinH · 12-23-2024, 06:17 PM

If the html extension writes the epub file and adds the meta charset="iso-8859-1" that means the text is encoded that way. It must be read in as iso-8859-1 and recoded to utf-8 properly. Sigil can actually detect that and properly encode it as utf-8 but not if you change that charset meta info or improperly add the xml header saying it is utf-8 when it is not.

A better technique is to use a python script to read each html file in as iso-8859-1 (sometimes called latin-1) and recode it and write it out as utf-8.

Shohreh · 12-24-2024, 04:34 AM

Thanks. Until the extension can handle iso-8859-1 in addition to uft-8, I'll see if I can fix it by running the EPUB output through Python.

But why does it display accented characters with ?'s in the original EPUB output file?
Editing the EPUB file in Sigil to replace "utf-8" with "iso-8859-1" doesn't solve the problem: Now, the questions marks all turn into "ï¿½" :-/

I see that "é" in ANSI is 0xE9 and it's 0x00E9 in Unicode. So why can't the application display the original character, with just an empty character before it (as 0x00)? Does the extension convert 0xE9 to another character that's available in Unicode but not ANSI?

Sarmat89 · 12-24-2024, 07:43 AM

Because it tries to interpret your document as UTF-8, finds invalid multibyte sequences, and translates it into the replacement character, which you see as ï¿½ when you interpret the document as Latin-1.
You should consider other tools to convert HTML into EPUB.

Shohreh · 12-24-2024, 08:16 AM

Thank you. I don't care about the tool. I wanted to understand why the problem occurs.

So what seems to happen under the hood when that extension reads bytes as utf-8 (because it doesn't support other code pages):
1. Any byte 128-255 is turned into two bytes, with a leading 00, so that, for instance, "é" is turned from 233 (0xE9) to "00233" (0x00E9)
2. The reason for the question marks is that the application looks for that two-byte codepoint in the font instead of the original single-byte codepoint

Shohreh · 12-24-2024, 08:51 AM

Looking at the EPUB in hexmode, I see that each accented character is replaced with "EF BF BD".

Here's the explanation: "The sequence "ef bf bd" is UTF-8 for U+FFFD (REPLACEMENT CHARACTER), i.e., a special code that is shown as "�", as mentioned in your question. Therefor, something (Python?) must have replaced the original char with this code. So your terminal appears to be okay.

The 'é' character U+00E9 (LATIN SMALL LETTER E WITH ACUTE) in UTF-8 would read "c3 a9" instead.

It is conceivable that your original subtitle might be encoded as CP1252, where the 'e' is represented by code 0xe9. Since the next byte is 0x72 ('r'), your parser might have interpreted the 0xe9 as an incomplete UTF-8 sequence and therefor replaced the "e9" with "ef bf bd" (REPLACEMENT CHARACTER)." (source)

So once a string/file has been corrupted by replacing each problematic character into "EF BF BD"… there's no going back to the original data other than fixing errors manually (if you know the original language).

KevinH · 12-24-2024, 02:03 PM

Iso-8859-1 is a one byte per char text encoding. It is incompatible with utf-8 which is a multibyte encoding although many of the lower 127 chars do map byte for byte to utf-8. Many chars over127 do not.

Any attempt to open a iso-8859-1 (latin-1) encoded file by a text editor will guess utf-8 wrongly and create a one way path to encoding hell. There is no way to recover from it without manual editing.

Which is why in python I would open and read the latin-1 file as binary data (bytes). Then use python "decode" to convert it to full unicode string, which you can the encode back to utf-8 bytes and write the new file back as binary.

Shohreh · 12-26-2024, 12:57 PM

Yes, data must be read in binary and with the right decoder.

The extension only supports utf-8 and doesn't throw an error if a web page uses another encoding,eg. Latin1/iso-8859-1. It's the first time I had the issue in the weeks I've been using it, so it's no biggie. It was the opportunity to understand how both encodings work.

For the curious in the audience, here's how utf-8 works:
1. If a byte is worth 0-127, it remains untouched
2. If it's 128-159, it's considered wrong and replaced with the sequence "0xEFBFBD", ie. "�"
3. If it's 160-255, it's the leading byte of a two-byte combo

For instance, "É" in ISO-8859-1 is 0xC9 or 11001001 in binary. To convert it to utf-8, the first two bits (11) are put in the leading byte (11000011) and the other bits are put in the trailing byte (10001001) → 0xC389.

https://en.wikipedia.org/wiki/UTF-8#Description

12-23-2024, 05:26 PM	#1
Shohreh Addict Posts: 207 Karma: 304158 Join Date: Jan 2016 Location: France Device: none	[SOLVED] Making sense of faulty HTML to EPUB conversion Hello, I'm using a browser extension to convert HTML pages into EPUB files. It works fine when the web page is in utf-8 but it doesn't like web pages encoded in iso-8859-1, where accented characters are replaced with question marks — the attached screenshots are when opening the EPUB file in SumatraPDF and Sigil. To make matters worse, the extension replaces the encoding meta line with "charset="iso-8859-1". I'd like to understand why accented characters are replaced with question marks. Is it a font issue? Or a problem with byte values? Thank you. --- Edit: I should have typed "the extension replaces the encoding meta line with "charset="utf-8" Attached Thumbnails Last edited by Shohreh; 12-24-2024 at 08:51 AM.

12-24-2024, 04:34 AM	#3
Shohreh Addict Posts: 207 Karma: 304158 Join Date: Jan 2016 Location: France Device: none	Thanks. Until the extension can handle iso-8859-1 in addition to uft-8, I'll see if I can fix it by running the EPUB output through Python. But why does it display accented characters with ?'s in the original EPUB output file? Editing the EPUB file in Sigil to replace "utf-8" with "iso-8859-1" doesn't solve the problem: Now, the questions marks all turn into "ï¿½" :-/ I see that "é" in ANSI is 0xE9 and it's 0x00E9 in Unicode. So why can't the application display the original character, with just an empty character before it (as 0x00)? Does the extension convert 0xE9 to another character that's available in Unicode but not ANSI? Attached Thumbnails Last edited by Shohreh; 12-24-2024 at 06:51 AM.

12-24-2024, 08:16 AM	#5
Shohreh Addict Posts: 207 Karma: 304158 Join Date: Jan 2016 Location: France Device: none	Thank you. I don't care about the tool. I wanted to understand why the problem occurs. So what seems to happen under the hood when that extension reads bytes as utf-8 (because it doesn't support other code pages): 1. Any byte 128-255 is turned into two bytes, with a leading 00, so that, for instance, "é" is turned from 233 (0xE9) to "00233" (0x00E9) 2. The reason for the question marks is that the application looks for that two-byte codepoint in the font instead of the original single-byte codepoint Attached Thumbnails

12-24-2024, 08:51 AM	#6
Shohreh Addict Posts: 207 Karma: 304158 Join Date: Jan 2016 Location: France Device: none	Looking at the EPUB in hexmode, I see that each accented character is replaced with "EF BF BD". Here's the explanation: "The sequence "ef bf bd" is UTF-8 for U+FFFD (REPLACEMENT CHARACTER), i.e., a special code that is shown as "�", as mentioned in your question. Therefor, something (Python?) must have replaced the original char with this code. So your terminal appears to be okay. The 'é' character U+00E9 (LATIN SMALL LETTER E WITH ACUTE) in UTF-8 would read "c3 a9" instead. It is conceivable that your original subtitle might be encoded as CP1252, where the 'e' is represented by code 0xe9. Since the next byte is 0x72 ('r'), your parser might have interpreted the 0xe9 as an incomplete UTF-8 sequence and therefor replaced the "e9" with "ef bf bd" (REPLACEMENT CHARACTER)." (source) So once a string/file has been corrupted by replacing each problematic character into "EF BF BD"… there's no going back to the original data other than fixing errors manually (if you know the original language). Last edited by Shohreh; 12-24-2024 at 09:02 AM.

12-24-2024, 02:03 PM	#7
KevinH Sigil Developer Posts: 8,816 Karma: 6000000 Join Date: Nov 2009 Device: many	Iso-8859-1 is a one byte per char text encoding. It is incompatible with utf-8 which is a multibyte encoding although many of the lower 127 chars do map byte for byte to utf-8. Many chars over127 do not. Any attempt to open a iso-8859-1 (latin-1) encoded file by a text editor will guess utf-8 wrongly and create a one way path to encoding hell. There is no way to recover from it without manual editing. Which is why in python I would open and read the latin-1 file as binary data (bytes). Then use python "decode" to convert it to full unicode string, which you can the encode back to utf-8 bytes and write the new file back as binary. Last edited by KevinH; 12-24-2024 at 03:20 PM.

12-23-2024, 06:17 PM	#2
KevinH Sigil Developer Posts: 8,816 Karma: 6000000 Join Date: Nov 2009 Device: many	If the html extension writes the epub file and adds the meta charset="iso-8859-1" that means the text is encoded that way. It must be read in as iso-8859-1 and recoded to utf-8 properly. Sigil can actually detect that and properly encode it as utf-8 but not if you change that charset meta info or improperly add the xml header saying it is utf-8 when it is not. A better technique is to use a python script to read each html file in as iso-8859-1 (sometimes called latin-1) and recode it and write it out as utf-8.

12-24-2024, 07:43 AM	#4
Sarmat89 Fanatic Posts: 518 Karma: 2268308 Join Date: Nov 2015 Device: none	Because it tries to interpret your document as UTF-8, finds invalid multibyte sequences, and translates it into the replacement character, which you see as ï¿½ when you interpret the document as Latin-1. You should consider other tools to convert HTML into EPUB.

12-26-2024, 12:57 PM	#8
Shohreh Addict Posts: 207 Karma: 304158 Join Date: Jan 2016 Location: France Device: none	Yes, data must be read in binary and with the right decoder. The extension only supports utf-8 and doesn't throw an error if a web page uses another encoding,eg. Latin1/iso-8859-1. It's the first time I had the issue in the weeks I've been using it, so it's no biggie. It was the opportunity to understand how both encodings work. For the curious in the audience, here's how utf-8 works: 1. If a byte is worth 0-127, it remains untouched 2. If it's 128-159, it's considered wrong and replaced with the sequence "0xEFBFBD", ie. "�" 3. If it's 160-255, it's the leading byte of a two-byte combo For instance, "É" in ISO-8859-1 is 0xC9 or 11001001 in binary. To convert it to utf-8, the first two bits (11) are put in the leading byte (11000011) and the other bits are put in the trailing byte (10001001) → 0xC389. https://en.wikipedia.org/wiki/UTF-8#Description Last edited by Shohreh; 12-26-2024 at 12:59 PM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Calibre EPUB Conversion -- EPUB 3 and HTML vs. XHTML internal file naming	GranitStateColin	Calibre	5	06-04-2023 09:44 AM
Faulty ePub to mobi conversion	troubledMan	Kindle Formats	7	06-29-2016 10:15 AM
html to epub CLI conversion / html input	m4mmon	Conversion	2	05-05-2012 02:10 AM
Free (Kindle) Making Sense of People: Decoding the Mysteries of Personality (FT Press	arcadata	Deals and Resources (No Self-Promotion or Affiliate Links)	0	08-29-2011 03:20 AM
Making sense of location numbers	gshipley	Amazon Kindle	2	02-13-2008 06:51 PM

Advert

Advert