On my Windows 8 system, both LibreOffice and OpenOffice export to HTML with the following meta header:
<META HTTP-EQUIV="CONTENT-TYPE" CONTENT="text/html; charset=windows-1252">
But quite how LO and OO exports and encodes to HTML on other os platforms like Linux and OSX is unknown to me. I'm guessing that they probably export as utf-8 on Linux and utf-16 on OSX but not really sure. When I researched UnicodeDammit I found that it would identify widows-1252, latin-1, ISO/IEC 8859-2 and utf-8 without much problems. And while researching UnicodeDammit from bs4 I found out that it also uses the chardet and cchardet modules as well as the codecs module in its routines.
I also take your point about checking for other weird encodings besides the ones that I've mentioned already. Will look into that and try to implement a fix soon.
I would also completely agree with you about zip supporting utf-8 for file contents. But I was really talking about about zip file names. For zip file names I think you'll find that only DOS Latin US charset is allowed.
I'm mainly using 7-Zip and WinRar for the zip files.
Last edited by slowsmile; 01-13-2017 at 12:44 AM.
|