@KevinH...In my plugin, I've used
Nick Coghlan's advised method (see section: Files in an ASCII compatible encoding, best effort is acceptable). I use a function which initially opens the user html file as a text file in latin-1 using the 'surrogateescape' error handler. This allows me to later fix simple mixed encoding errors in utf-8 caused by windows-1232 or latin-1 that are beyond the ASCII range.
In the function, the html file is then read as a text file using BeautifulSoup and copied straight back out again to replace the original html file in the working directory. Doing this inherently converts the html file to unicode utf-8. I basically rely on BeautifulSoup's in-built and automatic encoding detection using UnicodeDammit to detect and change the html file encoding to unicode utf-8. I've also checked the encoding directly after running this function and the file is always in utf-8. I've also checked the html file after complete conversion and it is always in utf-8.
In the past I've also tried using chardet with codecs to detect file encoding as you have advised. But I have found this method to be consistently poor and inaccurate.
I also must confess that I never really considered html text with different languages being used in my plugin. So I'm definitely willing to learn more about this for sure.