View Single Post
Old 01-13-2017, 09:29 AM   #30
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 8,939
Karma: 6361444
Join Date: Nov 2009
Device: many
@bravosx,
Thank you. The bug is therefore in how the plugin determines and handles the encoding. It seems to only work properly with Win1252.

@slowsmile - please do revamp your plugin to properly handle encodings if provided in meta element of the html file by reading it in binary (getting bytes) and using re (on bytes) to check for a charset specifier. If one is found, try using that encoding to decode the bytes to a python3 str and when outputting encode it as utf-8 (after removing any now incorrect charset specifiers, or alternatively try using cchardet or chardet to detect and/or confirm your encoding guess. It seems your approach of always reading a file in extended ascii with an error handler set to encode errors does not work as I suspected.

Hope this helps,

KevinH
KevinH is offline   Reply With Quote