MobileRead Forums - View Single Post - Making sense of faulty HTML to EPUB conversion

KevinH · 12-24-2024, 03:03 PM

Iso-8859-1 is a one byte per char text encoding. It is incompatible with utf-8 which is a multibyte encoding although many of the lower 127 chars do map byte for byte to utf-8. Many chars over127 do not.

Any attempt to open a iso-8859-1 (latin-1) encoded file by a text editor will guess utf-8 wrongly and create a one way path to encoding hell. There is no way to recover from it without manual editing.

Which is why in python I would open and read the latin-1 file as binary data (bytes). Then use python "decode" to convert it to full unicode string, which you can the encode back to utf-8 bytes and write the new file back as binary.

12-24-2024, 03:03 PM	#7
KevinH Sigil Developer Posts: 9,073 Karma: 6361556 Join Date: Nov 2009 Device: many	Iso-8859-1 is a one byte per char text encoding. It is incompatible with utf-8 which is a multibyte encoding although many of the lower 127 chars do map byte for byte to utf-8. Many chars over127 do not. Any attempt to open a iso-8859-1 (latin-1) encoded file by a text editor will guess utf-8 wrongly and create a one way path to encoding hell. There is no way to recover from it without manual editing. Which is why in python I would open and read the latin-1 file as binary data (bytes). Then use python "decode" to convert it to full unicode string, which you can the encode back to utf-8 bytes and write the new file back as binary. Last edited by KevinH; 12-24-2024 at 04:20 PM.