@Doitsu...Thanks for your advice -- I've already incorporated getprefferred encoding() into my encoding check function. It all really depends on which method of obtaining the encoding you trust the most. I trust the html meta tag method the most while trusting the chardet method the least. I use the getprefferredencoding() method as a fallback. The chardet encoding results are also terribly inaccurate which is why I'll be testing UnicodeDammit/detwingle today as a possible substitute.
The check encoding function passes the discovered encoding to another function that converts the input file to utf-8. If there is an error in this converter function -- due to the discovered encoding being wrong -- then this function will throw an exception and the plugin app will not continue.
My function(which I'm still testing) now looks like this:
Spoiler:
def checkFileEncoding(file):
html_encoding = None
chardet_encoding = ''
final_encoding = ''
# get the encoding info from the html meta headers
text = open(file, 'rt', encoding='iso-8859-1', errors='surrogateescape').read(2048)
if 'charset=windows-1252' in text.lower():
html_encoding = 'cp1252'
elif 'charset=windows-1250' in text.lower():
html_encoding = 'cp1250'
elif 'charset=windows-1253' in text.lower():
html_encoding = 'cp1253'
elif 'charset=windows-1254' in text.lower():
html_encoding = 'cp1254'
elif 'charset=windows-1251' in text.lower():
html_encoding = 'cp1251'
elif 'charset=windows-1255' in text.lower():
html_encoding = 'cp1255'
elif 'charset=windows-1256' in text.lower():
html_encoding = 'cp1256'
elif 'charset=windows-1257' in text.lower():
html_encoding = 'cp1257'
elif 'charset=us-ascii' in text.lower():
html_encoding = 'us-ascii'
elif 'charset=ibm437' in text.lower():
html_encoding = 'cp437'
elif 'charset=ibm850' in text.lower():
html_encoding = 'cp850'
elif 'charset=ibm852' in text.lower():
html_encoding = 'cp852'
elif 'charset=ibm855' in text.lower():
html_encoding = 'cp855'
elif 'charset=iso-8859-1' in text.lower():
html_encoding = 'iso-8859-1'
elif 'charset=iso-8859-2' in text.lower():
html_encoding = 'iso-8859-2'
elif 'charset=iso-8859-4' in text.lower():
html_encoding = 'iso-8859-4'
elif 'charset=utf-8' in text.lower():
html_encoding = 'utf-8'
else:
# get the locale encoding, if needed
html_encoding = locale.getpreferredencoding()
# now get the file encoding using chardet
rawdata = codecs.open(file, "rb").read(2048)
result = chardet.detect(rawdata)
chardet_encoding = result['encoding']
print(' >>> html enc...' + html_encoding)
print(' >>> chardet enc...' + chardet_encoding)
# compare the html and chardet encodings
final_encoding = chardet_encoding
if html_encoding != None and chardet_encoding.upper() != html_encoding.upper():
final_encoding = html_encoding
print(' -- Input file encoding is: ' + final_encoding.upper())
return(final_encoding)