Quote:
Originally Posted by slowsmile
... I'm guessing that they probably export as utf-8 on Linux and utf-16 on OSX but not really sure.
|
Mac OS X always uses utf-8 for terminals, paths, and etc unless the user has forced it to something else.
Quote:
|
When I researched UnicodeDammit I found that it would identify widows-1252, latin-1, ISO/IEC 8859-2 and utf-8 without much problems. And while researching UnicodeDammit from bs4 I found out that it also uses the chardet and cchardet modules as well as the codecs module in its routines.
|
And as I remember it will try and detect the charset meta info as well if it exists. Try reading the file in python3 as binary 'rb' and send the bytes to the UnicodeDammit routine and see if it will properly detect the encodings. It should.
Quote:
|
I would also completely agree with you about zip supporting utf-8 for file contents. But I was really talking about about zip file names. For zip file names I think you'll find that only DOS Latin US charset is allowed.
|
No there is a flag for file name info encoding as well in zip (again it can always be viewed as a sequence of bytes like on Linux). Using the internal python3 zip module should automatically handle all of this fwiw.
Take care,
KevinH