@bravosx,
Thank you. The bug is therefore in how the plugin determines and handles the encoding. It seems to only work properly with Win1252.
@slowsmile - please do revamp your plugin to properly handle encodings if provided in meta element of the html file by reading it in binary (getting bytes) and using re (on bytes) to check for a charset specifier. If one is found, try using that encoding to decode the bytes to a python3 str and when outputting encode it as utf-8 (after removing any now incorrect charset specifiers, or alternatively try using cchardet or chardet to detect and/or confirm your encoding guess. It seems your approach of always reading a file in extended ascii with an error handler set to encode errors does not work as I suspected.
Hope this helps,
KevinH
|