MobileRead Forums - View Single Post - Unicode & mi.set_user_metadata('#customxxx', custcol)

DaltonST · 07-23-2014, 01:47 PM

Quote:

Originally Posted by kovidgoyal

You're conflating two different issues. First utf-8 is completely irrelevant if your data is already a unicode object. If it is a unicode object, then you dont need to do anything, calling set() will work.

If it is not working, either u_genre is not what you think it is, or your column type is not what you think it is.

Use a print (repr(u_genre)) to see what u_genre is. Custom columns are perfectly capable of diplaying unicode characters from the entire BMP.

And now for a small lecture on strings in python.

Python has two string types.

1) The bytes object. This, as it's name implies is simply an array of raw bytes. This array may or may not be in any definite encoding. If you plan to use it as text, you should always decode it by using decode(whatever_encoding_you_think_it_is_in)

2) The unicode object. This is an array of unicode codepoints. That is a number from 0 to 65535. Each of these number represents a single character from the BMP (Basic Multilingual Plane) of the unicode standard. (there are actually more complications for non-BMP unicode text that I wont get into here). Internally, the unicode object is represented as either UTF-16 or UCS-4 (also known as UTF-32), depending on the compilation options of python. On windows it is always UTF-16.

Everywhere you deal with text in calibre, you must ensure you always use unicode objects, not bytestrings (the bytes objects from (1))

Kovid,

================================================== ======

Dear Kovid,

I assiduously follow the Python 2.7 best practices of "decode early, unicode everywhere, encode late".

Thanks for the lecture. Very well received.

The data is coming from a file encoded in "Unicode-Subset-UTF8". It is remaining in "Unicode-Subset-UTF8". See the attached image proving its encoding.

Also attached is an image of "Unicode-Subset-UTF8" data read from the .csv file just above and inserted immediately and directly into metadata.db. Note the Unicode complex characters, such as in não-ficção. ASCII cannot do that. Only Unicode.

I am using "Unicode-Subset-UTF8" because that is the most common variant of Unicode. UTF-8 can only handle about 60,000 unicode symbols. Full unicode has over 1 million. However, 60,000 is enough for most uses. UTF-8 may be a subset of full Unicode, but it is still "unicode", and will return a True in "if isinstance(xxxxx, unicode).

An image of the custom column configuration in Calibre > Preferences is also attached. It is simply Text.

An image of sqlite table custom_columns within metadata.db is also attached, showing exactly what metadata.db has for the custom column.

The "if isinstance(xxxxx, unicode) test you had me do several posts ago proves beyond a shadow of a doubt that the data is unicode. It is also intuitively obvious that u'N\xe3o-fic\xe7\xe3o' is unicode, anyway, given its structure. It prints everywhere that can handle virtually any code page above ascii as não-ficção. Python 2.7 IDLE prints it properly, of course.

I know that the Calibre gui handles "Unicode-Subset-UTF8", because I have copied and pasted não-ficção into the Tag field for a few test ebooks. Ditto for book Title and book Author. See the attached image.

I know that SQLite handles "Unicode-Subset-UTF8". See the attached image that proves that.

Per your request, I did a repr(u_genre) in IDLE. See the attached image. It returned: "u'N\\xe3o-fic\\xe7\\xe3o'" Repr wrapped it in double quotes.

Per your request, I did a repr(u_genre) in my plugin in ui.py . See the attached image. Repr made a unicode object out of a unicode object. It wrapped u'N\xe3o-fic\xe7\xe3o' in the standard u'xxxxxxxxxxxx' form. See the attached.

I tried wrapping the utf8 variable in double quotes prior to updating it in mi.set_user_metadata('#genre', custcol) , but all that did was turn a unicode object into a unicode object wrapped in double quotes. Better than what repr() did, though, since there was only a single u'.

So, in short, I have pure "Unicode-Subset-UTF8" data that came from Unicode(UTF-8), was INSERTED into metadata.db by SQLite, then was fetched from metadata.db by SQLite and put into a Python dictionary just the way it arrived from metadata.db via SQLite. That Python dictionary is the source of u_genre and the book to which it belongs, and as the "if isinstance(u_genre, unicode)" test proves, they were still "Unicode-Subset-UTF8" objects when given to mi.set_user_metadata to update.

What does this all mean? mi.set_user_metadata is turning a "Unicode-Subset-UTF8" object into a byte string before storing it in metadata.db. See the attached image of the raw SQLite data in metadata.db. The values are identical to the debug print output in ui.py immediately prior to the mi.set_user_metadata statement.

There is a lot of objective proof here that mi.set_user_metadata is not "Unicode-Subset-UTF8" friendly. Given that "Unicode-Subset-UTF8" is by far the most common subset, at least according to the folks at https://docs.python.org/2/howto/unicode.html . They say:

UTF-8 has several convenient properties:
* It can handle any Unicode code point.
* A Unicode string is turned into a string of bytes containing no embedded zero bytes. This avoids byte-ordering issues, and means UTF-8 strings can be processed by C functions such as strcpy() and sent through protocols that can’t handle zero bytes.
* A string of ASCII text is also valid UTF-8 text.
* UTF-8 is fairly compact; the majority of code points are turned into two bytes, and values less than 128 occupy only a single byte.
* If bytes are corrupted or lost, it’s possible to determine the start of the next UTF-8-encoded code point and resynchronize. It’s also unlikely that random 8-bit data will look like valid UTF-8.

In summary, there is a lot of objective proof attached to this post that mi.set_user_metadata likely should be enhanced to be "Unicode-Subset-UTF8" friendly.