MobileRead Forums - View Single Post - [Plugin] OpenDocHTMLImport

slowsmile · 01-12-2017, 08:10 PM

@KevinH...In my plugin, I've used Nick Coghlan's advised method (see section: Files in an ASCII compatible encoding, best effort is acceptable). I use a function which initially opens the user html file as a text file in latin-1 using the 'surrogateescape' error handler. This allows me to later fix simple mixed encoding errors in utf-8 caused by windows-1232 or latin-1 that are beyond the ASCII range.

In the function, the html file is then read as a text file using BeautifulSoup and copied straight back out again to replace the original html file in the working directory. Doing this inherently converts the html file to unicode utf-8. I basically rely on BeautifulSoup's in-built and automatic encoding detection using UnicodeDammit to detect and change the html file encoding to unicode utf-8. I've also checked the encoding directly after running this function and the file is always in utf-8. I've also checked the html file after complete conversion and it is always in utf-8.

In the past I've also tried using chardet with codecs to detect file encoding as you have advised. But I have found this method to be consistently poor and inaccurate.

I also must confess that I never really considered html text with different languages being used in my plugin. So I'm definitely willing to learn more about this for sure.

01-12-2017, 08:10 PM	#18
slowsmile Witchman Posts: 628 Karma: 788808 Join Date: May 2013 Location: Philippines Device: Android S5	@KevinH...In my plugin, I've used Nick Coghlan's advised method (see section: Files in an ASCII compatible encoding, best effort is acceptable). I use a function which initially opens the user html file as a text file in latin-1 using the 'surrogateescape' error handler. This allows me to later fix simple mixed encoding errors in utf-8 caused by windows-1232 or latin-1 that are beyond the ASCII range. In the function, the html file is then read as a text file using BeautifulSoup and copied straight back out again to replace the original html file in the working directory. Doing this inherently converts the html file to unicode utf-8. I basically rely on BeautifulSoup's in-built and automatic encoding detection using UnicodeDammit to detect and change the html file encoding to unicode utf-8. I've also checked the encoding directly after running this function and the file is always in utf-8. I've also checked the html file after complete conversion and it is always in utf-8. In the past I've also tried using chardet with codecs to detect file encoding as you have advised. But I have found this method to be consistently poor and inaccurate. I also must confess that I never really considered html text with different languages being used in my plugin. So I'm definitely willing to learn more about this for sure.