View Single Post
Old 01-12-2017, 11:52 PM   #25
slowsmile
Witchman
slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.
 
Posts: 628
Karma: 788808
Join Date: May 2013
Location: Philippines
Device: Android S5
On my Windows 8 system, both LibreOffice and OpenOffice export to HTML with the following meta header:

<META HTTP-EQUIV="CONTENT-TYPE" CONTENT="text/html; charset=windows-1252">

But quite how LO and OO exports and encodes to HTML on other os platforms like Linux and OSX is unknown to me. I'm guessing that they probably export as utf-8 on Linux and utf-16 on OSX but not really sure. When I researched UnicodeDammit I found that it would identify widows-1252, latin-1, ISO/IEC 8859-2 and utf-8 without much problems. And while researching UnicodeDammit from bs4 I found out that it also uses the chardet and cchardet modules as well as the codecs module in its routines.

I also take your point about checking for other weird encodings besides the ones that I've mentioned already. Will look into that and try to implement a fix soon.

I would also completely agree with you about zip supporting utf-8 for file contents. But I was really talking about about zip file names. For zip file names I think you'll find that only DOS Latin US charset is allowed.

I'm mainly using 7-Zip and WinRar for the zip files.

Last edited by slowsmile; 01-13-2017 at 12:44 AM.
slowsmile is offline   Reply With Quote