![]() |
#1 |
Addict
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 205
Karma: 304158
Join Date: Jan 2016
Location: France
Device: none
|
![]()
Hello,
I'm using a browser extension to convert HTML pages into EPUB files. It works fine when the web page is in utf-8 but it doesn't like web pages encoded in iso-8859-1, where accented characters are replaced with question marks — the attached screenshots are when opening the EPUB file in SumatraPDF and Sigil. To make matters worse, the extension replaces the encoding meta line with "charset="iso-8859-1". I'd like to understand why accented characters are replaced with question marks. Is it a font issue? Or a problem with byte values? Thank you. --- Edit: I should have typed "the extension replaces the encoding meta line with "charset="utf-8" ![]() Last edited by Shohreh; 12-24-2024 at 08:51 AM. |
![]() |
![]() |
![]() |
#2 |
Sigil Developer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 8,434
Karma: 5702578
Join Date: Nov 2009
Device: many
|
If the html extension writes the epub file and adds the meta charset="iso-8859-1" that means the text is encoded that way. It must be read in as iso-8859-1 and recoded to utf-8 properly. Sigil can actually detect that and properly encode it as utf-8 but not if you change that charset meta info or improperly add the xml header saying it is utf-8 when it is not.
A better technique is to use a python script to read each html file in as iso-8859-1 (sometimes called latin-1) and recode it and write it out as utf-8. |
![]() |
![]() |
Advert | |
|
![]() |
#3 |
Addict
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 205
Karma: 304158
Join Date: Jan 2016
Location: France
Device: none
|
Thanks. Until the extension can handle iso-8859-1 in addition to uft-8, I'll see if I can fix it by running the EPUB output through Python.
But why does it display accented characters with ?'s in the original EPUB output file? Editing the EPUB file in Sigil to replace "utf-8" with "iso-8859-1" doesn't solve the problem: Now, the questions marks all turn into "�" :-/ I see that "é" in ANSI is 0xE9 and it's 0x00E9 in Unicode. So why can't the application display the original character, with just an empty character before it (as 0x00)? Does the extension convert 0xE9 to another character that's available in Unicode but not ANSI? Last edited by Shohreh; 12-24-2024 at 06:51 AM. |
![]() |
![]() |
![]() |
#4 |
Fanatic
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 515
Karma: 2268308
Join Date: Nov 2015
Device: none
|
Because it tries to interpret your document as UTF-8, finds invalid multibyte sequences, and translates it into the replacement character, which you see as � when you interpret the document as Latin-1.
You should consider other tools to convert HTML into EPUB. |
![]() |
![]() |
![]() |
#5 |
Addict
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 205
Karma: 304158
Join Date: Jan 2016
Location: France
Device: none
|
Thank you. I don't care about the tool. I wanted to understand why the problem occurs.
So what seems to happen under the hood when that extension reads bytes as utf-8 (because it doesn't support other code pages): 1. Any byte 128-255 is turned into two bytes, with a leading 00, so that, for instance, "é" is turned from 233 (0xE9) to "00233" (0x00E9) 2. The reason for the question marks is that the application looks for that two-byte codepoint in the font instead of the original single-byte codepoint |
![]() |
![]() |
Advert | |
|
![]() |
#6 |
Addict
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 205
Karma: 304158
Join Date: Jan 2016
Location: France
Device: none
|
Looking at the EPUB in hexmode, I see that each accented character is replaced with "EF BF BD".
Here's the explanation: "The sequence "ef bf bd" is UTF-8 for U+FFFD (REPLACEMENT CHARACTER), i.e., a special code that is shown as "�", as mentioned in your question. Therefor, something (Python?) must have replaced the original char with this code. So your terminal appears to be okay. The 'é' character U+00E9 (LATIN SMALL LETTER E WITH ACUTE) in UTF-8 would read "c3 a9" instead. It is conceivable that your original subtitle might be encoded as CP1252, where the 'e' is represented by code 0xe9. Since the next byte is 0x72 ('r'), your parser might have interpreted the 0xe9 as an incomplete UTF-8 sequence and therefor replaced the "e9" with "ef bf bd" (REPLACEMENT CHARACTER)." (source) So once a string/file has been corrupted by replacing each problematic character into "EF BF BD"… there's no going back to the original data other than fixing errors manually (if you know the original language). Last edited by Shohreh; 12-24-2024 at 09:02 AM. |
![]() |
![]() |
![]() |
#7 |
Sigil Developer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 8,434
Karma: 5702578
Join Date: Nov 2009
Device: many
|
Iso-8859-1 is a one byte per char text encoding. It is incompatible with utf-8 which is a multibyte encoding although many of the lower 127 chars do map byte for byte to utf-8. Many chars over127 do not.
Any attempt to open a iso-8859-1 (latin-1) encoded file by a text editor will guess utf-8 wrongly and create a one way path to encoding hell. There is no way to recover from it without manual editing. Which is why in python I would open and read the latin-1 file as binary data (bytes). Then use python "decode" to convert it to full unicode string, which you can the encode back to utf-8 bytes and write the new file back as binary. Last edited by KevinH; 12-24-2024 at 03:20 PM. |
![]() |
![]() |
![]() |
#8 |
Addict
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 205
Karma: 304158
Join Date: Jan 2016
Location: France
Device: none
|
Yes, data must be read in binary and with the right decoder.
The extension only supports utf-8 and doesn't throw an error if a web page uses another encoding,eg. Latin1/iso-8859-1. It's the first time I had the issue in the weeks I've been using it, so it's no biggie. It was the opportunity to understand how both encodings work. For the curious in the audience, here's how utf-8 works: 1. If a byte is worth 0-127, it remains untouched 2. If it's 128-159, it's considered wrong and replaced with the sequence "0xEFBFBD", ie. "�" 3. If it's 160-255, it's the leading byte of a two-byte combo For instance, "É" in ISO-8859-1 is 0xC9 or 11001001 in binary. To convert it to utf-8, the first two bits (11) are put in the leading byte (11000011) and the other bits are put in the trailing byte (10001001) → 0xC389. https://en.wikipedia.org/wiki/UTF-8#Description Last edited by Shohreh; 12-26-2024 at 12:59 PM. |
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Calibre EPUB Conversion -- EPUB 3 and HTML vs. XHTML internal file naming | GranitStateColin | Calibre | 5 | 06-04-2023 09:44 AM |
Faulty ePub to mobi conversion | troubledMan | Kindle Formats | 7 | 06-29-2016 10:15 AM |
html to epub CLI conversion / html input | m4mmon | Conversion | 2 | 05-05-2012 02:10 AM |
Free (Kindle) Making Sense of People: Decoding the Mysteries of Personality (FT Press | arcadata | Deals and Resources (No Self-Promotion or Affiliate Links) | 0 | 08-29-2011 03:20 AM |
Making sense of location numbers | gshipley | Amazon Kindle | 2 | 02-13-2008 06:51 PM |