You're conflating two different issues. First utf-8 is completely irrelevant if your data is already a unicode object. If it is a unicode object, then you dont need to do anything, calling set() will work.
If it is not working, either u_genre is not what you think it is, or your column type is not what you think it is.
Use a print (repr(u_genre)) to see what u_genre is. Custom columns are perfectly capable of diplaying unicode characters from the entire BMP.
And now for a small lecture on strings in python.
Python has two string types.
1) The bytes object. This, as it's name implies is simply an array of raw bytes (numbers between 0 and 255). This array may or may not be in any definite encoding. If you plan to use it as text, you should always decode it by using decode(whatever_encoding_you_think_it_is_in)
2) The unicode object. This is an array of unicode codepoints. That is a number from 0 to 65535. Each of these numbers represents a single character from the BMP (Basic Multilingual Plane) of the unicode standard. (there are actually more complications for non-BMP unicode text, diacritics, glyphs, etc. that I wont get into here). Internally, the unicode object is represented as either UTF-16 or UCS-4 (also known as UTF-32), depending on the compilation options of python. On windows it is always UTF-16.
Everywhere you deal with text in calibre, you must ensure you always use unicode objects, not bytestrings (the bytes objects from (1))
Last edited by kovidgoyal; 07-23-2014 at 01:11 PM.
|