Hi Kovid,
Yes I read that PEP about variable size storage methods for the new strings and looked at their data structures for storing strings under the new formats. It looks like an engineers nightmare. They have fields for latin-1, utf-8, utf-16, and ucs-4 all stored in two different string structures depending on the size of the largest character, and they use bitfields to store info, etc.
And their interface routine decisions are a joke. According to some e-mails, string manipulation slowed down by over 30% whereas storage really wasn't much better. I read an article that said that after simple compression, utf-8 (even for non-BMP) takes up less space than ucs-4 due to the degree of byte repetition for non-BMP languages. So they could have stuck with utf-8 for Linux/unix and used utf-16 le with multiple chars used to encode ucs-4 when needed for Windows. Or even moved all platforms to utf-16. Imagine debugging a buggy program with gdb. You would need gdb macros to just figure out what the string data said!!
I must say that I am very unimpressed with many of the developer decisions. But they don't seem to see how silly they are being and how many future bugs and nightmares they are creating with such nonsense. They have forgotten the KISS principle of all good engineers.
As I said before, if a large organization said they would fork Python 2 and fix the many longstanding bugs and make their own new releases, I would stay with Python 2 and forget Python 3. As it stands I can just hedge my bets, by supporting both with one codebase and seeing what happens.
Take care,
KevinH
Quote:
Originally Posted by kovidgoyal
@KevinH: I see you've started discovering the joys of Python 3  Be glad you dont have to port any C extension modules. In Python 2 strings are internally always UTF-16 (except on linux) which is great because all external libraries (the windows API, ICU, etc.) all use UTF-16. As of python 3.3 a python string can be any of ascii, UCS2 or UCS4, depending on its contents. So now every time you call any external API function with a python string, you have to inspect and convert it. Joy, joy, joy.
And if you thought that dealing with binary file formats was bad, think about all the network facing code -- all network protocols are binary. I really dont know what the python 3 devs were smoking. Thank heavens python is open source and I can continue using python 2 for a long, long time. Hopefully, I can retire before it becomes necessary to port calibre from python 2.
|