Unicode & mi.set_user_metadata('#customxxx', custcol)

DaltonST · 07-23-2014, 07:05 AM

Class Metadata in src>calibre>ebooks>metadata>book>base.py has nothing special regarding utf-8 encoding of new values for custom columns updated via custcol['#value#'] = u_xxxxxx and then mi.set_user_metadata('#customxxx', custcol). My dictionary used for updating custcol['#value#'] = u_xxxxxx is pure utf-8. However, after updating, the gui shows everything as utf-8 literals instead of the equivalent of what would be displayed if it were to be printed. Example: u_xxxxxx might be u'N\xe3o-fic\xe7\xe3o' , which should appear to the user in the gui as Não-ficção. Instead, it appears to the user as u'N\xe3o-fic\xe7\xe3o' . Is there any special encoding or method() or other syntax required for custcol['#value#'] = u_xxxxxx in order to force mi.set_user_metadata to update itself in a manner suitable for display?

Thanks in advance for your help.

kovidgoyal · 07-23-2014, 07:20 AM

If your custom column stores multiple values like tags, you need to set the value to a list of values not a string.

And if you want to set the value of a field, use set() not set_user_metadata(). Like this:

mi.set('#custcol', ['value1', 'value2'])

DaltonST · 07-23-2014, 07:56 AM

Kovid,

No, single value in single custom column per single book in the gui. My question is not how to update a custom column for a single value. It is how to provide it the correct format using the correct syntax so it displays the new value in human format, not utf-8 format. The mi.set_user_metadata('#customxxx', custcol) updates perfectly except for showing u'N\xe3o-fic\xe7\xe3o' instead of Não-ficção in the gui.

kovidgoyal · 07-23-2014, 08:05 AM

I showed you the syntax below, for a single valued column you use

mi.set('#custcol', u'N\xe3o-fic\xe7\xe3o')

DaltonST · 07-23-2014, 08:17 AM

Kovid,

There was no difference in output on the display screen. The gui still shows the same unicode literal u'N\xe3o-fic\xe7\xe3o' after changing to using mi.set(xxx instead of mi.set_user_metadata(xxx. Both update the custom column perfectly. Identical output. Again, my question is not about how to get it updated in general, but rather how to get it updated so the human being sitting at the Calibre gui does not see "u'N\xe3o-fic\xe7\xe3o', but instead sees the "real" word of Não-ficção. The gui shows the unicode string, not the human-readable string. Is there special syntax to make Calibre show the value in human readable form instead of the original unicode string?

Please see the attached image.

kovidgoyal · 07-23-2014, 08:38 AM

Do this:

mi.set('#custcol', [u'N\xe3o-fic\xe7\xe3o'])

DaltonST · 07-23-2014, 08:50 AM

Now nothing is updated, or perhaps a null was updated. mi.set_user_metadata('#genre', custcol) at least updates the unicode literal, as undesirable as that may be. This is the exact code as of this moment:

for book, genre in book_new_genre_final.items(): #dictionary utf-8
mi = Metadata(_('Unknown'))
u_book = book #still in utf8
u_genre = genre #still in utf8
n_book = int(u_book)
if u_book > 0 and u_genre > ' ':
custcol = custom_columns['#genre']
custcol['#value#'] = u_genre
mi.set('#genre', [u_genre])
# mi.set_user_metadata('#genre', custcol)
books_updated.append(n_book)
id_map[n_book] = mi
payload = books_updated
edit_metadata_action = self.gui.iactions['Edit Metadata']
edit_metadata_action.apply_metadata_changes(id_map , callback=self._finish_displaying_results(payload))
else:
xxxxxxxx

kovidgoyal · 07-23-2014, 09:20 AM

There is nothing special you need to so to set unicode values. unicode objects in python are not utf-8, they are either utf-16 or ucs-4 depending on compile time options.

Where are you getting u_genre from?

If you want to set a unicode value pass in a unicode string not a bytestring. That means,

Code:

if isintance(u_genre, bytes):
  u_genre = u_genre.decode('utf-8')

DaltonST · 07-23-2014, 09:28 AM

u_genre is pure utf-8. That is why it is showing up as a utf-8 "string", u'N\xe3o-fic\xe7\xe3o'. It never was a byte string, ever.

I have reverted to using mi.set_user_metadata('#genre', custcol) , and the gui shows u'N\xe3o-fic\xe7\xe3o' again instead of a null. That was my motivation for starting this thread in the first place.

u_genre was never not a unicode string. It was never anything except utf-8. It was imported via a .csv file encoded as Unicode(utf-8). I use a drop-in replace for Python 2.x csv dictreader that totally supports utf-8.

The header of my .py files all say : # -*- coding: utf-8 -*-

All of the plugins I have seen use # -*- coding: utf-8 -*-.

Are you saying it should be # -*- coding: utf-16 -*- ?

kovidgoyal · 07-23-2014, 09:30 AM

Again, unicode objects in python are not utf-8, they are either utf-16 or ucs-4 depending on compile time options.

Do this:

Code:

if isintance(u_genre, bytes):
  u_genre = u_genre.decode('utf-8')

before you use u_genre anywhere.

And from the sound of it "totally supports utf-8" is totally incorrect, if it is yielding bytestrings from the csv file instead of unicode objects.

DaltonST · 07-23-2014, 09:50 AM

My native Python 2.7 IDLE accepts my utf-8 u'N\xe3o-fic\xe7\xe3o' when I copy it in, and then prints it correctly as Não-ficção on my screen.

I do not use any byte strings. Pure utf-8. I use temp tables in metadata.db, and they show the utf-8 unicode strings properly as Não-ficção on my pc display using a SQLite management application. So metadata.db has the pure utf-8 data, and I get it back from there to update it in mi.set_user_metadata('#customxxx', custcol).

According to https://docs.python.org/2/howto/unicode.html, "Under the hood, Python represents Unicode strings as either 16- or 32-bit integers, depending on how the Python interpreter was compiled". That doesn't mean that it does not support utf-8.

The same source also says: "UTF-8 is one of the most commonly used encodings. UTF stands for “Unicode Transformation Format”, and the ‘8’ means that 8-bit numbers are used in the encoding. (There’s also a UTF-16 encoding, but it’s less frequently used than UTF-8.) "

Calibre's personal copy of Python 2.7x apparently does not support utf-8, although SQLite does. Otherwise, metadata.db would not be updated correctly.

Kovid, thanks again for you help.

kovidgoyal · 07-23-2014, 09:54 AM

Again,

Do this:

Code:

if isinstance(u_genre, bytes):
  u_genre = u_genre.decode('utf-8')

DaltonST · 07-23-2014, 10:05 AM

Attached to this is a screen snippet of the output of the following code in my ui.py:

id_map = {}
for book, genre in book_new_genre_final.items():
#print("book, genre : ", book, genre)
mi = Metadata(_('Unknown'))
u_book = book #still in utf8
u_genre = genre #still in utf8
n_book = int(u_book)

if isinstance(u_genre, bytes):
print("u_genre is bytes")
else:
if isinstance(u_genre, unicode):
print("u_genre is utf8")
else:
print("u_genre is something else...")

if n_book > 0 and u_genre > ' ':

All of the output said "u_genre is utf8".

Please look at the screen snippet from metadata.db, which is also attached. Note the complex characters that only unicode can support, such as German umlauts and the Portuguese word.

kovidgoyal · 07-23-2014, 12:03 PM

You're conflating two different issues. First utf-8 is completely irrelevant if your data is already a unicode object. If it is a unicode object, then you dont need to do anything, calling set() will work.

If it is not working, either u_genre is not what you think it is, or your column type is not what you think it is.

Use a print (repr(u_genre)) to see what u_genre is. Custom columns are perfectly capable of diplaying unicode characters from the entire BMP.

And now for a small lecture on strings in python.

Python has two string types.

1) The bytes object. This, as it's name implies is simply an array of raw bytes (numbers between 0 and 255). This array may or may not be in any definite encoding. If you plan to use it as text, you should always decode it by using decode(whatever_encoding_you_think_it_is_in)

2) The unicode object. This is an array of unicode codepoints. That is a number from 0 to 65535. Each of these numbers represents a single character from the BMP (Basic Multilingual Plane) of the unicode standard. (there are actually more complications for non-BMP unicode text, diacritics, glyphs, etc. that I wont get into here). Internally, the unicode object is represented as either UTF-16 or UCS-4 (also known as UTF-32), depending on the compilation options of python. On windows it is always UTF-16.

Everywhere you deal with text in calibre, you must ensure you always use unicode objects, not bytestrings (the bytes objects from (1))

DaltonST · 07-23-2014, 01:47 PM

Quote:

Originally Posted by kovidgoyal

You're conflating two different issues. First utf-8 is completely irrelevant if your data is already a unicode object. If it is a unicode object, then you dont need to do anything, calling set() will work.

If it is not working, either u_genre is not what you think it is, or your column type is not what you think it is.

Use a print (repr(u_genre)) to see what u_genre is. Custom columns are perfectly capable of diplaying unicode characters from the entire BMP.

And now for a small lecture on strings in python.

Python has two string types.

1) The bytes object. This, as it's name implies is simply an array of raw bytes. This array may or may not be in any definite encoding. If you plan to use it as text, you should always decode it by using decode(whatever_encoding_you_think_it_is_in)

2) The unicode object. This is an array of unicode codepoints. That is a number from 0 to 65535. Each of these number represents a single character from the BMP (Basic Multilingual Plane) of the unicode standard. (there are actually more complications for non-BMP unicode text that I wont get into here). Internally, the unicode object is represented as either UTF-16 or UCS-4 (also known as UTF-32), depending on the compilation options of python. On windows it is always UTF-16.

Everywhere you deal with text in calibre, you must ensure you always use unicode objects, not bytestrings (the bytes objects from (1))

Kovid,

================================================== ======

Dear Kovid,

I assiduously follow the Python 2.7 best practices of "decode early, unicode everywhere, encode late".

Thanks for the lecture. Very well received.

The data is coming from a file encoded in "Unicode-Subset-UTF8". It is remaining in "Unicode-Subset-UTF8". See the attached image proving its encoding.

Also attached is an image of "Unicode-Subset-UTF8" data read from the .csv file just above and inserted immediately and directly into metadata.db. Note the Unicode complex characters, such as in não-ficção. ASCII cannot do that. Only Unicode.

I am using "Unicode-Subset-UTF8" because that is the most common variant of Unicode. UTF-8 can only handle about 60,000 unicode symbols. Full unicode has over 1 million. However, 60,000 is enough for most uses. UTF-8 may be a subset of full Unicode, but it is still "unicode", and will return a True in "if isinstance(xxxxx, unicode).

An image of the custom column configuration in Calibre > Preferences is also attached. It is simply Text.

An image of sqlite table custom_columns within metadata.db is also attached, showing exactly what metadata.db has for the custom column.

The "if isinstance(xxxxx, unicode) test you had me do several posts ago proves beyond a shadow of a doubt that the data is unicode. It is also intuitively obvious that u'N\xe3o-fic\xe7\xe3o' is unicode, anyway, given its structure. It prints everywhere that can handle virtually any code page above ascii as não-ficção. Python 2.7 IDLE prints it properly, of course.

I know that the Calibre gui handles "Unicode-Subset-UTF8", because I have copied and pasted não-ficção into the Tag field for a few test ebooks. Ditto for book Title and book Author. See the attached image.

I know that SQLite handles "Unicode-Subset-UTF8". See the attached image that proves that.

Per your request, I did a repr(u_genre) in IDLE. See the attached image. It returned: "u'N\\xe3o-fic\\xe7\\xe3o'" Repr wrapped it in double quotes.

Per your request, I did a repr(u_genre) in my plugin in ui.py . See the attached image. Repr made a unicode object out of a unicode object. It wrapped u'N\xe3o-fic\xe7\xe3o' in the standard u'xxxxxxxxxxxx' form. See the attached.

I tried wrapping the utf8 variable in double quotes prior to updating it in mi.set_user_metadata('#genre', custcol) , but all that did was turn a unicode object into a unicode object wrapped in double quotes. Better than what repr() did, though, since there was only a single u'.

So, in short, I have pure "Unicode-Subset-UTF8" data that came from Unicode(UTF-8), was INSERTED into metadata.db by SQLite, then was fetched from metadata.db by SQLite and put into a Python dictionary just the way it arrived from metadata.db via SQLite. That Python dictionary is the source of u_genre and the book to which it belongs, and as the "if isinstance(u_genre, unicode)" test proves, they were still "Unicode-Subset-UTF8" objects when given to mi.set_user_metadata to update.

What does this all mean? mi.set_user_metadata is turning a "Unicode-Subset-UTF8" object into a byte string before storing it in metadata.db. See the attached image of the raw SQLite data in metadata.db. The values are identical to the debug print output in ui.py immediately prior to the mi.set_user_metadata statement.

There is a lot of objective proof here that mi.set_user_metadata is not "Unicode-Subset-UTF8" friendly. Given that "Unicode-Subset-UTF8" is by far the most common subset, at least according to the folks at https://docs.python.org/2/howto/unicode.html . They say:

UTF-8 has several convenient properties:
* It can handle any Unicode code point.
* A Unicode string is turned into a string of bytes containing no embedded zero bytes. This avoids byte-ordering issues, and means UTF-8 strings can be processed by C functions such as strcpy() and sent through protocols that can’t handle zero bytes.
* A string of ASCII text is also valid UTF-8 text.
* UTF-8 is fairly compact; the majority of code points are turned into two bytes, and values less than 128 occupy only a single byte.
* If bytes are corrupted or lost, it’s possible to determine the start of the next UTF-8-encoded code point and resynchronize. It’s also unlikely that random 8-bit data will look like valid UTF-8.

In summary, there is a lot of objective proof attached to this post that mi.set_user_metadata likely should be enhanced to be "Unicode-Subset-UTF8" friendly.

07-23-2014, 07:05 AM	#1
DaltonST Deviser Posts: 2,265 Karma: 2090983 Join Date: Aug 2013 Location: Texas Device: none	Unicode & mi.set_user_metadata('#customxxx', custcol) Class Metadata in src>calibre>ebooks>metadata>book>base.py has nothing special regarding utf-8 encoding of new values for custom columns updated via custcol['#value#'] = u_xxxxxx and then mi.set_user_metadata('#customxxx', custcol). My dictionary used for updating custcol['#value#'] = u_xxxxxx is pure utf-8. However, after updating, the gui shows everything as utf-8 literals instead of the equivalent of what would be displayed if it were to be printed. Example: u_xxxxxx might be u'N\xe3o-fic\xe7\xe3o' , which should appear to the user in the gui as Não-ficção. Instead, it appears to the user as u'N\xe3o-fic\xe7\xe3o' . Is there any special encoding or method() or other syntax required for custcol['#value#'] = u_xxxxxx in order to force mi.set_user_metadata to update itself in a manner suitable for display? Thanks in advance for your help.

07-23-2014, 08:17 AM	#5
DaltonST Deviser Posts: 2,265 Karma: 2090983 Join Date: Aug 2013 Location: Texas Device: none	Unicode Literally Appearing On Gui Display Kovid, There was no difference in output on the display screen. The gui still shows the same unicode literal u'N\xe3o-fic\xe7\xe3o' after changing to using mi.set(xxx instead of mi.set_user_metadata(xxx. Both update the custom column perfectly. Identical output. Again, my question is not about how to get it updated in general, but rather how to get it updated so the human being sitting at the Calibre gui does not see "u'N\xe3o-fic\xe7\xe3o', but instead sees the "real" word of Não-ficção. The gui shows the unicode string, not the human-readable string. Is there special syntax to make Calibre show the value in human readable form instead of the original unicode string? Please see the attached image. Attached Thumbnails Last edited by DaltonST; 07-23-2014 at 08:35 AM. Reason: Added Screen Snippet

07-23-2014, 08:50 AM	#7
DaltonST Deviser Posts: 2,265 Karma: 2090983 Join Date: Aug 2013 Location: Texas Device: none	Now nothing is updated, or perhaps a null was updated. mi.set_user_metadata('#genre', custcol) at least updates the unicode literal, as undesirable as that may be. This is the exact code as of this moment: for book, genre in book_new_genre_final.items(): #dictionary utf-8 mi = Metadata(_('Unknown')) u_book = book #still in utf8 u_genre = genre #still in utf8 n_book = int(u_book) if u_book > 0 and u_genre > ' ': custcol = custom_columns['#genre'] custcol['#value#'] = u_genre mi.set('#genre', [u_genre]) # mi.set_user_metadata('#genre', custcol) books_updated.append(n_book) id_map[n_book] = mi payload = books_updated edit_metadata_action = self.gui.iactions['Edit Metadata'] edit_metadata_action.apply_metadata_changes(id_map , callback=self._finish_displaying_results(payload)) else: xxxxxxxx Last edited by DaltonST; 07-23-2014 at 08:57 AM.

07-23-2014, 09:20 AM	#8
kovidgoyal creator of calibre Posts: 43,844 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	There is nothing special you need to so to set unicode values. unicode objects in python are not utf-8, they are either utf-16 or ucs-4 depending on compile time options. Where are you getting u_genre from? If you want to set a unicode value pass in a unicode string not a bytestring. That means, Code: if isintance(u_genre, bytes): u_genre = u_genre.decode('utf-8') Last edited by kovidgoyal; 07-23-2014 at 09:28 AM.

07-23-2014, 09:30 AM	#10
kovidgoyal creator of calibre Posts: 43,844 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Again, unicode objects in python are not utf-8, they are either utf-16 or ucs-4 depending on compile time options. Do this: Code: if isintance(u_genre, bytes): u_genre = u_genre.decode('utf-8') before you use u_genre anywhere. And from the sound of it "totally supports utf-8" is totally incorrect, if it is yielding bytestrings from the csv file instead of unicode objects.

07-23-2014, 07:20 AM	#2
kovidgoyal creator of calibre Posts: 43,844 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	If your custom column stores multiple values like tags, you need to set the value to a list of values not a string. And if you want to set the value of a field, use set() not set_user_metadata(). Like this: mi.set('#custcol', ['value1', 'value2'])

07-23-2014, 07:56 AM	#3
DaltonST Deviser Posts: 2,265 Karma: 2090983 Join Date: Aug 2013 Location: Texas Device: none	Kovid, No, single value in single custom column per single book in the gui. My question is not how to update a custom column for a single value. It is how to provide it the correct format using the correct syntax so it displays the new value in human format, not utf-8 format. The mi.set_user_metadata('#customxxx', custcol) updates perfectly except for showing u'N\xe3o-fic\xe7\xe3o' instead of Não-ficção in the gui.

07-23-2014, 08:05 AM	#4
kovidgoyal creator of calibre Posts: 43,844 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	I showed you the syntax below, for a single valued column you use mi.set('#custcol', u'N\xe3o-fic\xe7\xe3o')

07-23-2014, 08:38 AM	#6
kovidgoyal creator of calibre Posts: 43,844 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Do this: mi.set('#custcol', [u'N\xe3o-fic\xe7\xe3o'])

07-23-2014, 09:28 AM	#9
DaltonST Deviser Posts: 2,265 Karma: 2090983 Join Date: Aug 2013 Location: Texas Device: none	u_genre is pure utf-8. That is why it is showing up as a utf-8 "string", u'N\xe3o-fic\xe7\xe3o'. It never was a byte string, ever. I have reverted to using mi.set_user_metadata('#genre', custcol) , and the gui shows u'N\xe3o-fic\xe7\xe3o' again instead of a null. That was my motivation for starting this thread in the first place. u_genre was never not a unicode string. It was never anything except utf-8. It was imported via a .csv file encoded as Unicode(utf-8). I use a drop-in replace for Python 2.x csv dictreader that totally supports utf-8. The header of my .py files all say : # -- coding: utf-8 -- All of the plugins I have seen use # -- coding: utf-8 --. Are you saying it should be # -- coding: utf-16 -- ?

07-23-2014, 09:50 AM	#11
DaltonST Deviser Posts: 2,265 Karma: 2090983 Join Date: Aug 2013 Location: Texas Device: none	My native Python 2.7 IDLE accepts my utf-8 u'N\xe3o-fic\xe7\xe3o' when I copy it in, and then prints it correctly as Não-ficção on my screen. I do not use any byte strings. Pure utf-8. I use temp tables in metadata.db, and they show the utf-8 unicode strings properly as Não-ficção on my pc display using a SQLite management application. So metadata.db has the pure utf-8 data, and I get it back from there to update it in mi.set_user_metadata('#customxxx', custcol). According to https://docs.python.org/2/howto/unicode.html, "Under the hood, Python represents Unicode strings as either 16- or 32-bit integers, depending on how the Python interpreter was compiled". That doesn't mean that it does not support utf-8. The same source also says: "UTF-8 is one of the most commonly used encodings. UTF stands for “Unicode Transformation Format”, and the ‘8’ means that 8-bit numbers are used in the encoding. (There’s also a UTF-16 encoding, but it’s less frequently used than UTF-8.) " Calibre's personal copy of Python 2.7x apparently does not support utf-8, although SQLite does. Otherwise, metadata.db would not be updated correctly. Kovid, thanks again for you help.

07-23-2014, 09:54 AM	#12
kovidgoyal creator of calibre Posts: 43,844 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Again, Do this: Code: if isinstance(u_genre, bytes): u_genre = u_genre.decode('utf-8')

07-23-2014, 10:05 AM	#13
DaltonST Deviser Posts: 2,265 Karma: 2090983 Join Date: Aug 2013 Location: Texas Device: none	Attached to this is a screen snippet of the output of the following code in my ui.py: id_map = {} for book, genre in book_new_genre_final.items(): #print("book, genre : ", book, genre) mi = Metadata(_('Unknown')) u_book = book #still in utf8 u_genre = genre #still in utf8 n_book = int(u_book) if isinstance(u_genre, bytes): print("u_genre is bytes") else: if isinstance(u_genre, unicode): print("u_genre is utf8") else: print("u_genre is something else...") if n_book > 0 and u_genre > ' ': All of the output said "u_genre is utf8". Please look at the screen snippet from metadata.db, which is also attached. Note the complex characters that only unicode can support, such as German umlauts and the Portuguese word. Attached Thumbnails Last edited by DaltonST; 07-23-2014 at 10:46 AM. Reason: Changed code to show latest version

07-23-2014, 12:03 PM	#14
kovidgoyal creator of calibre Posts: 43,844 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	You're conflating two different issues. First utf-8 is completely irrelevant if your data is already a unicode object. If it is a unicode object, then you dont need to do anything, calling set() will work. If it is not working, either u_genre is not what you think it is, or your column type is not what you think it is. Use a print (repr(u_genre)) to see what u_genre is. Custom columns are perfectly capable of diplaying unicode characters from the entire BMP. And now for a small lecture on strings in python. Python has two string types. 1) The bytes object. This, as it's name implies is simply an array of raw bytes (numbers between 0 and 255). This array may or may not be in any definite encoding. If you plan to use it as text, you should always decode it by using decode(whatever_encoding_you_think_it_is_in) 2) The unicode object. This is an array of unicode codepoints. That is a number from 0 to 65535. Each of these numbers represents a single character from the BMP (Basic Multilingual Plane) of the unicode standard. (there are actually more complications for non-BMP unicode text, diacritics, glyphs, etc. that I wont get into here). Internally, the unicode object is represented as either UTF-16 or UCS-4 (also known as UTF-32), depending on the compilation options of python. On windows it is always UTF-16. Everywhere you deal with text in calibre, you must ensure you always use unicode objects, not bytestrings (the bytes objects from (1)) Last edited by kovidgoyal; 07-23-2014 at 01:11 PM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
PRS-T1 PRS-T1 & Asian Fonts/Unicode	komugi	Sony Reader	20	10-05-2013 11:49 PM
RegEx & Unicode	capnm	Library Management	14	12-01-2011 08:23 PM

Advert

Advert