View Single Post
Old 07-25-2009, 02:00 PM   #5
user_none
Sigil & calibre developer
user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.
 
user_none's Avatar
 
Posts: 2,487
Karma: 1063785
Join Date: Jan 2009
Location: Florida, USA
Device: Nook STR
The file is encoded in cp1252. The original PML is in cp1252 and eReader2html preserves this. The problem is detecting the encoding. cp1252 is a superset of the Latin-1 encoding. Latin-1 is a subset of utf-8. calibre internally uses utf-8. When detecting file encoding the first few bytes of the file are tested. The fist few bytes of a book converted with eReader2html will be <html>. Which are valid Latin-1 characters and the document is encoded with utf-8.

There is no good or easy way to determine the actual encoding of the file. The two options are, check for any of a number of cp1252 specific characters within the file. Or try encoding with every codepage and see if it succeeds. Both are time consuming and wasteful.

One other option would be to modify eReader2html to encode the file as utf-8.

For the time being you will just have to specify the --input-encoding="cp1252" when converting.
user_none is offline   Reply With Quote