MobileRead Forums - View Single Post

user_none · 07-25-2009, 03:00 PM

The file is encoded in cp1252. The original PML is in cp1252 and eReader2html preserves this. The problem is detecting the encoding. cp1252 is a superset of the Latin-1 encoding. Latin-1 is a subset of utf-8. calibre internally uses utf-8. When detecting file encoding the first few bytes of the file are tested. The fist few bytes of a book converted with eReader2html will be <html>. Which are valid Latin-1 characters and the document is encoded with utf-8.

There is no good or easy way to determine the actual encoding of the file. The two options are, check for any of a number of cp1252 specific characters within the file. Or try encoding with every codepage and see if it succeeds. Both are time consuming and wasteful.

One other option would be to modify eReader2html to encode the file as utf-8.

For the time being you will just have to specify the --input-encoding="cp1252" when converting.

07-25-2009, 03:00 PM	#5
user_none Sigil & calibre developer Posts: 2,487 Karma: 1063785 Join Date: Jan 2009 Location: Florida, USA Device: Nook STR	The file is encoded in cp1252. The original PML is in cp1252 and eReader2html preserves this. The problem is detecting the encoding. cp1252 is a superset of the Latin-1 encoding. Latin-1 is a subset of utf-8. calibre internally uses utf-8. When detecting file encoding the first few bytes of the file are tested. The fist few bytes of a book converted with eReader2html will be <html>. Which are valid Latin-1 characters and the document is encoded with utf-8. There is no good or easy way to determine the actual encoding of the file. The two options are, check for any of a number of cp1252 specific characters within the file. Or try encoding with every codepage and see if it succeeds. Both are time consuming and wasteful. One other option would be to modify eReader2html to encode the file as utf-8. For the time being you will just have to specify the --input-encoding="cp1252" when converting.