View Single Post
Old 07-23-2014, 12:03 PM   #14
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 45,449
Karma: 27757438
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
You're conflating two different issues. First utf-8 is completely irrelevant if your data is already a unicode object. If it is a unicode object, then you dont need to do anything, calling set() will work.

If it is not working, either u_genre is not what you think it is, or your column type is not what you think it is.

Use a print (repr(u_genre)) to see what u_genre is. Custom columns are perfectly capable of diplaying unicode characters from the entire BMP.


And now for a small lecture on strings in python.

Python has two string types.

1) The bytes object. This, as it's name implies is simply an array of raw bytes (numbers between 0 and 255). This array may or may not be in any definite encoding. If you plan to use it as text, you should always decode it by using decode(whatever_encoding_you_think_it_is_in)

2) The unicode object. This is an array of unicode codepoints. That is a number from 0 to 65535. Each of these numbers represents a single character from the BMP (Basic Multilingual Plane) of the unicode standard. (there are actually more complications for non-BMP unicode text, diacritics, glyphs, etc. that I wont get into here). Internally, the unicode object is represented as either UTF-16 or UCS-4 (also known as UTF-32), depending on the compilation options of python. On windows it is always UTF-16.

Everywhere you deal with text in calibre, you must ensure you always use unicode objects, not bytestrings (the bytes objects from (1))

Last edited by kovidgoyal; 07-23-2014 at 01:11 PM.
kovidgoyal is offline   Reply With Quote