View Single Post
Old 01-13-2017, 06:23 PM   #37
slowsmile
Witchman
slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.
 
Posts: 628
Karma: 788808
Join Date: May 2013
Location: Philippines
Device: Android S5
@Doitsu...Thanks for your advice -- I've already incorporated getprefferred encoding() into my encoding check function. It all really depends on which method of obtaining the encoding you trust the most. I trust the html meta tag method the most while trusting the chardet method the least. I use the getprefferredencoding() method as a fallback. The chardet encoding results are also terribly inaccurate which is why I'll be testing UnicodeDammit/detwingle today as a possible substitute.

The check encoding function passes the discovered encoding to another function that converts the input file to utf-8. If there is an error in this converter function -- due to the discovered encoding being wrong -- then this function will throw an exception and the plugin app will not continue.

My function(which I'm still testing) now looks like this:

Spoiler:
def checkFileEncoding(file):
html_encoding = None
chardet_encoding = ''
final_encoding = ''

# get the encoding info from the html meta headers
text = open(file, 'rt', encoding='iso-8859-1', errors='surrogateescape').read(2048)

if 'charset=windows-1252' in text.lower():
html_encoding = 'cp1252'
elif 'charset=windows-1250' in text.lower():
html_encoding = 'cp1250'
elif 'charset=windows-1253' in text.lower():
html_encoding = 'cp1253'
elif 'charset=windows-1254' in text.lower():
html_encoding = 'cp1254'
elif 'charset=windows-1251' in text.lower():
html_encoding = 'cp1251'
elif 'charset=windows-1255' in text.lower():
html_encoding = 'cp1255'
elif 'charset=windows-1256' in text.lower():
html_encoding = 'cp1256'
elif 'charset=windows-1257' in text.lower():
html_encoding = 'cp1257'
elif 'charset=us-ascii' in text.lower():
html_encoding = 'us-ascii'
elif 'charset=ibm437' in text.lower():
html_encoding = 'cp437'
elif 'charset=ibm850' in text.lower():
html_encoding = 'cp850'
elif 'charset=ibm852' in text.lower():
html_encoding = 'cp852'
elif 'charset=ibm855' in text.lower():
html_encoding = 'cp855'
elif 'charset=iso-8859-1' in text.lower():
html_encoding = 'iso-8859-1'
elif 'charset=iso-8859-2' in text.lower():
html_encoding = 'iso-8859-2'
elif 'charset=iso-8859-4' in text.lower():
html_encoding = 'iso-8859-4'
elif 'charset=utf-8' in text.lower():
html_encoding = 'utf-8'
else:
# get the locale encoding, if needed
html_encoding = locale.getpreferredencoding()

# now get the file encoding using chardet
rawdata = codecs.open(file, "rb").read(2048)
result = chardet.detect(rawdata)
chardet_encoding = result['encoding']

print(' >>> html enc...' + html_encoding)
print(' >>> chardet enc...' + chardet_encoding)

# compare the html and chardet encodings
final_encoding = chardet_encoding
if html_encoding != None and chardet_encoding.upper() != html_encoding.upper():
final_encoding = html_encoding

print(' -- Input file encoding is: ' + final_encoding.upper())
return(final_encoding)

Last edited by slowsmile; 01-13-2017 at 06:52 PM.
slowsmile is offline   Reply With Quote