View Single Post
Old 01-12-2017, 07:10 PM   #18
slowsmile
Witchman
slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.
 
Posts: 628
Karma: 788808
Join Date: May 2013
Location: Philippines
Device: Android S5
@KevinH...In my plugin, I've used Nick Coghlan's advised method (see section: Files in an ASCII compatible encoding, best effort is acceptable). I use a function which initially opens the user html file as a text file in latin-1 using the 'surrogateescape' error handler. This allows me to later fix simple mixed encoding errors in utf-8 caused by windows-1232 or latin-1 that are beyond the ASCII range.

In the function, the html file is then read as a text file using BeautifulSoup and copied straight back out again to replace the original html file in the working directory. Doing this inherently converts the html file to unicode utf-8. I basically rely on BeautifulSoup's in-built and automatic encoding detection using UnicodeDammit to detect and change the html file encoding to unicode utf-8. I've also checked the encoding directly after running this function and the file is always in utf-8. I've also checked the html file after complete conversion and it is always in utf-8.

In the past I've also tried using chardet with codecs to detect file encoding as you have advised. But I have found this method to be consistently poor and inaccurate.

I also must confess that I never really considered html text with different languages being used in my plugin. So I'm definitely willing to learn more about this for sure.
slowsmile is offline   Reply With Quote