MobileRead Forums - View Single Post - Unicode & mi.set_user_metadata('#customxxx', custcol)

kovidgoyal · 07-23-2014, 12:03 PM

You're conflating two different issues. First utf-8 is completely irrelevant if your data is already a unicode object. If it is a unicode object, then you dont need to do anything, calling set() will work.

If it is not working, either u_genre is not what you think it is, or your column type is not what you think it is.

Use a print (repr(u_genre)) to see what u_genre is. Custom columns are perfectly capable of diplaying unicode characters from the entire BMP.

And now for a small lecture on strings in python.

Python has two string types.

1) The bytes object. This, as it's name implies is simply an array of raw bytes (numbers between 0 and 255). This array may or may not be in any definite encoding. If you plan to use it as text, you should always decode it by using decode(whatever_encoding_you_think_it_is_in)

2) The unicode object. This is an array of unicode codepoints. That is a number from 0 to 65535. Each of these numbers represents a single character from the BMP (Basic Multilingual Plane) of the unicode standard. (there are actually more complications for non-BMP unicode text, diacritics, glyphs, etc. that I wont get into here). Internally, the unicode object is represented as either UTF-16 or UCS-4 (also known as UTF-32), depending on the compilation options of python. On windows it is always UTF-16.

Everywhere you deal with text in calibre, you must ensure you always use unicode objects, not bytestrings (the bytes objects from (1))

07-23-2014, 12:03 PM	#14
kovidgoyal creator of calibre Posts: 45,449 Karma: 27757438 Join Date: Oct 2006 Location: Mumbai, India Device: Various	You're conflating two different issues. First utf-8 is completely irrelevant if your data is already a unicode object. If it is a unicode object, then you dont need to do anything, calling set() will work. If it is not working, either u_genre is not what you think it is, or your column type is not what you think it is. Use a print (repr(u_genre)) to see what u_genre is. Custom columns are perfectly capable of diplaying unicode characters from the entire BMP. And now for a small lecture on strings in python. Python has two string types. 1) The bytes object. This, as it's name implies is simply an array of raw bytes (numbers between 0 and 255). This array may or may not be in any definite encoding. If you plan to use it as text, you should always decode it by using decode(whatever_encoding_you_think_it_is_in) 2) The unicode object. This is an array of unicode codepoints. That is a number from 0 to 65535. Each of these numbers represents a single character from the BMP (Basic Multilingual Plane) of the unicode standard. (there are actually more complications for non-BMP unicode text, diacritics, glyphs, etc. that I wont get into here). Internally, the unicode object is represented as either UTF-16 or UCS-4 (also known as UTF-32), depending on the compilation options of python. On windows it is always UTF-16. Everywhere you deal with text in calibre, you must ensure you always use unicode objects, not bytestrings (the bytes objects from (1)) Last edited by kovidgoyal; 07-23-2014 at 01:11 PM.