Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Development

Notices

Reply
 
Thread Tools Search this Thread
Old 07-23-2014, 07:05 AM   #1
DaltonST
Deviser
DaltonST ought to be getting tired of karma fortunes by now.DaltonST ought to be getting tired of karma fortunes by now.DaltonST ought to be getting tired of karma fortunes by now.DaltonST ought to be getting tired of karma fortunes by now.DaltonST ought to be getting tired of karma fortunes by now.DaltonST ought to be getting tired of karma fortunes by now.DaltonST ought to be getting tired of karma fortunes by now.DaltonST ought to be getting tired of karma fortunes by now.DaltonST ought to be getting tired of karma fortunes by now.DaltonST ought to be getting tired of karma fortunes by now.DaltonST ought to be getting tired of karma fortunes by now.
 
DaltonST's Avatar
 
Posts: 2,265
Karma: 2090983
Join Date: Aug 2013
Location: Texas
Device: none
Unicode & mi.set_user_metadata('#customxxx', custcol)

Class Metadata in src>calibre>ebooks>metadata>book>base.py has nothing special regarding utf-8 encoding of new values for custom columns updated via custcol['#value#'] = u_xxxxxx and then mi.set_user_metadata('#customxxx', custcol). My dictionary used for updating custcol['#value#'] = u_xxxxxx is pure utf-8. However, after updating, the gui shows everything as utf-8 literals instead of the equivalent of what would be displayed if it were to be printed. Example: u_xxxxxx might be u'N\xe3o-fic\xe7\xe3o' , which should appear to the user in the gui as Não-ficção. Instead, it appears to the user as u'N\xe3o-fic\xe7\xe3o' . Is there any special encoding or method() or other syntax required for custcol['#value#'] = u_xxxxxx in order to force mi.set_user_metadata to update itself in a manner suitable for display?

Thanks in advance for your help.
DaltonST is offline   Reply With Quote
Old 07-23-2014, 07:20 AM   #2
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,844
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
If your custom column stores multiple values like tags, you need to set the value to a list of values not a string.

And if you want to set the value of a field, use set() not set_user_metadata(). Like this:

mi.set('#custcol', ['value1', 'value2'])
kovidgoyal is offline   Reply With Quote
Advert
Old 07-23-2014, 07:56 AM   #3
DaltonST
Deviser
DaltonST ought to be getting tired of karma fortunes by now.DaltonST ought to be getting tired of karma fortunes by now.DaltonST ought to be getting tired of karma fortunes by now.DaltonST ought to be getting tired of karma fortunes by now.DaltonST ought to be getting tired of karma fortunes by now.DaltonST ought to be getting tired of karma fortunes by now.DaltonST ought to be getting tired of karma fortunes by now.DaltonST ought to be getting tired of karma fortunes by now.DaltonST ought to be getting tired of karma fortunes by now.DaltonST ought to be getting tired of karma fortunes by now.DaltonST ought to be getting tired of karma fortunes by now.
 
DaltonST's Avatar
 
Posts: 2,265
Karma: 2090983
Join Date: Aug 2013
Location: Texas
Device: none
Kovid,

No, single value in single custom column per single book in the gui. My question is not how to update a custom column for a single value. It is how to provide it the correct format using the correct syntax so it displays the new value in human format, not utf-8 format. The mi.set_user_metadata('#customxxx', custcol) updates perfectly except for showing u'N\xe3o-fic\xe7\xe3o' instead of Não-ficção in the gui.
DaltonST is offline   Reply With Quote
Old 07-23-2014, 08:05 AM   #4
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,844
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
I showed you the syntax below, for a single valued column you use

mi.set('#custcol', u'N\xe3o-fic\xe7\xe3o')
kovidgoyal is offline   Reply With Quote
Old 07-23-2014, 08:17 AM   #5
DaltonST
Deviser
DaltonST ought to be getting tired of karma fortunes by now.DaltonST ought to be getting tired of karma fortunes by now.DaltonST ought to be getting tired of karma fortunes by now.DaltonST ought to be getting tired of karma fortunes by now.DaltonST ought to be getting tired of karma fortunes by now.DaltonST ought to be getting tired of karma fortunes by now.DaltonST ought to be getting tired of karma fortunes by now.DaltonST ought to be getting tired of karma fortunes by now.DaltonST ought to be getting tired of karma fortunes by now.DaltonST ought to be getting tired of karma fortunes by now.DaltonST ought to be getting tired of karma fortunes by now.
 
DaltonST's Avatar
 
Posts: 2,265
Karma: 2090983
Join Date: Aug 2013
Location: Texas
Device: none
Unicode Literally Appearing On Gui Display

Kovid,

There was no difference in output on the display screen. The gui still shows the same unicode literal u'N\xe3o-fic\xe7\xe3o' after changing to using mi.set(xxx instead of mi.set_user_metadata(xxx. Both update the custom column perfectly. Identical output. Again, my question is not about how to get it updated in general, but rather how to get it updated so the human being sitting at the Calibre gui does not see "u'N\xe3o-fic\xe7\xe3o', but instead sees the "real" word of Não-ficção. The gui shows the unicode string, not the human-readable string. Is there special syntax to make Calibre show the value in human readable form instead of the original unicode string?

Please see the attached image.
Attached Thumbnails
Click image for larger version

Name:	Capture2.JPG
Views:	277
Size:	19.4 KB
ID:	125723  

Last edited by DaltonST; 07-23-2014 at 08:35 AM. Reason: Added Screen Snippet
DaltonST is offline   Reply With Quote
Advert
Old 07-23-2014, 08:38 AM   #6
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,844
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Do this:

mi.set('#custcol', [u'N\xe3o-fic\xe7\xe3o'])
kovidgoyal is offline   Reply With Quote
Old 07-23-2014, 08:50 AM   #7
DaltonST
Deviser
DaltonST ought to be getting tired of karma fortunes by now.DaltonST ought to be getting tired of karma fortunes by now.DaltonST ought to be getting tired of karma fortunes by now.DaltonST ought to be getting tired of karma fortunes by now.DaltonST ought to be getting tired of karma fortunes by now.DaltonST ought to be getting tired of karma fortunes by now.DaltonST ought to be getting tired of karma fortunes by now.DaltonST ought to be getting tired of karma fortunes by now.DaltonST ought to be getting tired of karma fortunes by now.DaltonST ought to be getting tired of karma fortunes by now.DaltonST ought to be getting tired of karma fortunes by now.
 
DaltonST's Avatar
 
Posts: 2,265
Karma: 2090983
Join Date: Aug 2013
Location: Texas
Device: none
Now nothing is updated, or perhaps a null was updated. mi.set_user_metadata('#genre', custcol) at least updates the unicode literal, as undesirable as that may be. This is the exact code as of this moment:

for book, genre in book_new_genre_final.items(): #dictionary utf-8
mi = Metadata(_('Unknown'))
u_book = book #still in utf8
u_genre = genre #still in utf8
n_book = int(u_book)
if u_book > 0 and u_genre > ' ':
custcol = custom_columns['#genre']
custcol['#value#'] = u_genre
mi.set('#genre', [u_genre])
# mi.set_user_metadata('#genre', custcol)
books_updated.append(n_book)
id_map[n_book] = mi
payload = books_updated
edit_metadata_action = self.gui.iactions['Edit Metadata']
edit_metadata_action.apply_metadata_changes(id_map , callback=self._finish_displaying_results(payload))
else:
xxxxxxxx

Last edited by DaltonST; 07-23-2014 at 08:57 AM.
DaltonST is offline   Reply With Quote
Old 07-23-2014, 09:20 AM   #8
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,844
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
There is nothing special you need to so to set unicode values. unicode objects in python are not utf-8, they are either utf-16 or ucs-4 depending on compile time options.

Where are you getting u_genre from?

If you want to set a unicode value pass in a unicode string not a bytestring. That means,

Code:
if isintance(u_genre, bytes):
  u_genre = u_genre.decode('utf-8')

Last edited by kovidgoyal; 07-23-2014 at 09:28 AM.
kovidgoyal is offline   Reply With Quote
Old 07-23-2014, 09:28 AM   #9
DaltonST
Deviser
DaltonST ought to be getting tired of karma fortunes by now.DaltonST ought to be getting tired of karma fortunes by now.DaltonST ought to be getting tired of karma fortunes by now.DaltonST ought to be getting tired of karma fortunes by now.DaltonST ought to be getting tired of karma fortunes by now.DaltonST ought to be getting tired of karma fortunes by now.DaltonST ought to be getting tired of karma fortunes by now.DaltonST ought to be getting tired of karma fortunes by now.DaltonST ought to be getting tired of karma fortunes by now.DaltonST ought to be getting tired of karma fortunes by now.DaltonST ought to be getting tired of karma fortunes by now.
 
DaltonST's Avatar
 
Posts: 2,265
Karma: 2090983
Join Date: Aug 2013
Location: Texas
Device: none
u_genre is pure utf-8. That is why it is showing up as a utf-8 "string", u'N\xe3o-fic\xe7\xe3o'. It never was a byte string, ever.

I have reverted to using mi.set_user_metadata('#genre', custcol) , and the gui shows u'N\xe3o-fic\xe7\xe3o' again instead of a null. That was my motivation for starting this thread in the first place.

u_genre was never not a unicode string. It was never anything except utf-8. It was imported via a .csv file encoded as Unicode(utf-8). I use a drop-in replace for Python 2.x csv dictreader that totally supports utf-8.

The header of my .py files all say : # -*- coding: utf-8 -*-

All of the plugins I have seen use # -*- coding: utf-8 -*-.

Are you saying it should be # -*- coding: utf-16 -*- ?
DaltonST is offline   Reply With Quote
Old 07-23-2014, 09:30 AM   #10
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,844
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Again, unicode objects in python are not utf-8, they are either utf-16 or ucs-4 depending on compile time options.

Do this:

Code:
if isintance(u_genre, bytes):
  u_genre = u_genre.decode('utf-8')
before you use u_genre anywhere.

And from the sound of it "totally supports utf-8" is totally incorrect, if it is yielding bytestrings from the csv file instead of unicode objects.
kovidgoyal is offline   Reply With Quote
Old 07-23-2014, 09:50 AM   #11
DaltonST
Deviser
DaltonST ought to be getting tired of karma fortunes by now.DaltonST ought to be getting tired of karma fortunes by now.DaltonST ought to be getting tired of karma fortunes by now.DaltonST ought to be getting tired of karma fortunes by now.DaltonST ought to be getting tired of karma fortunes by now.DaltonST ought to be getting tired of karma fortunes by now.DaltonST ought to be getting tired of karma fortunes by now.DaltonST ought to be getting tired of karma fortunes by now.DaltonST ought to be getting tired of karma fortunes by now.DaltonST ought to be getting tired of karma fortunes by now.DaltonST ought to be getting tired of karma fortunes by now.
 
DaltonST's Avatar
 
Posts: 2,265
Karma: 2090983
Join Date: Aug 2013
Location: Texas
Device: none
My native Python 2.7 IDLE accepts my utf-8 u'N\xe3o-fic\xe7\xe3o' when I copy it in, and then prints it correctly as Não-ficção on my screen.

I do not use any byte strings. Pure utf-8. I use temp tables in metadata.db, and they show the utf-8 unicode strings properly as Não-ficção on my pc display using a SQLite management application. So metadata.db has the pure utf-8 data, and I get it back from there to update it in mi.set_user_metadata('#customxxx', custcol).

According to https://docs.python.org/2/howto/unicode.html, "Under the hood, Python represents Unicode strings as either 16- or 32-bit integers, depending on how the Python interpreter was compiled". That doesn't mean that it does not support utf-8.

The same source also says: "UTF-8 is one of the most commonly used encodings. UTF stands for “Unicode Transformation Format”, and the ‘8’ means that 8-bit numbers are used in the encoding. (There’s also a UTF-16 encoding, but it’s less frequently used than UTF-8.) "

Calibre's personal copy of Python 2.7x apparently does not support utf-8, although SQLite does. Otherwise, metadata.db would not be updated correctly.

Kovid, thanks again for you help.
DaltonST is offline   Reply With Quote
Old 07-23-2014, 09:54 AM   #12
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,844
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Again,

Do this:
Code:
if isinstance(u_genre, bytes):
  u_genre = u_genre.decode('utf-8')
kovidgoyal is offline   Reply With Quote
Old 07-23-2014, 10:05 AM   #13
DaltonST
Deviser
DaltonST ought to be getting tired of karma fortunes by now.DaltonST ought to be getting tired of karma fortunes by now.DaltonST ought to be getting tired of karma fortunes by now.DaltonST ought to be getting tired of karma fortunes by now.DaltonST ought to be getting tired of karma fortunes by now.DaltonST ought to be getting tired of karma fortunes by now.DaltonST ought to be getting tired of karma fortunes by now.DaltonST ought to be getting tired of karma fortunes by now.DaltonST ought to be getting tired of karma fortunes by now.DaltonST ought to be getting tired of karma fortunes by now.DaltonST ought to be getting tired of karma fortunes by now.
 
DaltonST's Avatar
 
Posts: 2,265
Karma: 2090983
Join Date: Aug 2013
Location: Texas
Device: none
Attached to this is a screen snippet of the output of the following code in my ui.py:

id_map = {}
for book, genre in book_new_genre_final.items():
#print("book, genre : ", book, genre)
mi = Metadata(_('Unknown'))
u_book = book #still in utf8
u_genre = genre #still in utf8
n_book = int(u_book)

if isinstance(u_genre, bytes):
print("u_genre is bytes")
else:
if isinstance(u_genre, unicode):
print("u_genre is utf8")
else:
print("u_genre is something else...")


if n_book > 0 and u_genre > ' ':

All of the output said "u_genre is utf8".

Please look at the screen snippet from metadata.db, which is also attached. Note the complex characters that only unicode can support, such as German umlauts and the Portuguese word.
Attached Thumbnails
Click image for larger version

Name:	Capture_utf8_proof.JPG
Views:	263
Size:	18.7 KB
ID:	125727   Click image for larger version

Name:	utf8_words_in_metadata.JPG
Views:	279
Size:	141.9 KB
ID:	125730  

Last edited by DaltonST; 07-23-2014 at 10:46 AM. Reason: Changed code to show latest version
DaltonST is offline   Reply With Quote
Old 07-23-2014, 12:03 PM   #14
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,844
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
You're conflating two different issues. First utf-8 is completely irrelevant if your data is already a unicode object. If it is a unicode object, then you dont need to do anything, calling set() will work.

If it is not working, either u_genre is not what you think it is, or your column type is not what you think it is.

Use a print (repr(u_genre)) to see what u_genre is. Custom columns are perfectly capable of diplaying unicode characters from the entire BMP.


And now for a small lecture on strings in python.

Python has two string types.

1) The bytes object. This, as it's name implies is simply an array of raw bytes (numbers between 0 and 255). This array may or may not be in any definite encoding. If you plan to use it as text, you should always decode it by using decode(whatever_encoding_you_think_it_is_in)

2) The unicode object. This is an array of unicode codepoints. That is a number from 0 to 65535. Each of these numbers represents a single character from the BMP (Basic Multilingual Plane) of the unicode standard. (there are actually more complications for non-BMP unicode text, diacritics, glyphs, etc. that I wont get into here). Internally, the unicode object is represented as either UTF-16 or UCS-4 (also known as UTF-32), depending on the compilation options of python. On windows it is always UTF-16.

Everywhere you deal with text in calibre, you must ensure you always use unicode objects, not bytestrings (the bytes objects from (1))

Last edited by kovidgoyal; 07-23-2014 at 01:11 PM.
kovidgoyal is offline   Reply With Quote
Old 07-23-2014, 01:47 PM   #15
DaltonST
Deviser
DaltonST ought to be getting tired of karma fortunes by now.DaltonST ought to be getting tired of karma fortunes by now.DaltonST ought to be getting tired of karma fortunes by now.DaltonST ought to be getting tired of karma fortunes by now.DaltonST ought to be getting tired of karma fortunes by now.DaltonST ought to be getting tired of karma fortunes by now.DaltonST ought to be getting tired of karma fortunes by now.DaltonST ought to be getting tired of karma fortunes by now.DaltonST ought to be getting tired of karma fortunes by now.DaltonST ought to be getting tired of karma fortunes by now.DaltonST ought to be getting tired of karma fortunes by now.
 
DaltonST's Avatar
 
Posts: 2,265
Karma: 2090983
Join Date: Aug 2013
Location: Texas
Device: none
Quote:
Originally Posted by kovidgoyal View Post
You're conflating two different issues. First utf-8 is completely irrelevant if your data is already a unicode object. If it is a unicode object, then you dont need to do anything, calling set() will work.

If it is not working, either u_genre is not what you think it is, or your column type is not what you think it is.

Use a print (repr(u_genre)) to see what u_genre is. Custom columns are perfectly capable of diplaying unicode characters from the entire BMP.


And now for a small lecture on strings in python.

Python has two string types.

1) The bytes object. This, as it's name implies is simply an array of raw bytes. This array may or may not be in any definite encoding. If you plan to use it as text, you should always decode it by using decode(whatever_encoding_you_think_it_is_in)

2) The unicode object. This is an array of unicode codepoints. That is a number from 0 to 65535. Each of these number represents a single character from the BMP (Basic Multilingual Plane) of the unicode standard. (there are actually more complications for non-BMP unicode text that I wont get into here). Internally, the unicode object is represented as either UTF-16 or UCS-4 (also known as UTF-32), depending on the compilation options of python. On windows it is always UTF-16.

Everywhere you deal with text in calibre, you must ensure you always use unicode objects, not bytestrings (the bytes objects from (1))
Kovid,

================================================== ======


Dear Kovid,

I assiduously follow the Python 2.7 best practices of "decode early, unicode everywhere, encode late".

Thanks for the lecture. Very well received.

The data is coming from a file encoded in "Unicode-Subset-UTF8". It is remaining in "Unicode-Subset-UTF8". See the attached image proving its encoding.

Also attached is an image of "Unicode-Subset-UTF8" data read from the .csv file just above and inserted immediately and directly into metadata.db. Note the Unicode complex characters, such as in não-ficção. ASCII cannot do that. Only Unicode.

I am using "Unicode-Subset-UTF8" because that is the most common variant of Unicode. UTF-8 can only handle about 60,000 unicode symbols. Full unicode has over 1 million. However, 60,000 is enough for most uses. UTF-8 may be a subset of full Unicode, but it is still "unicode", and will return a True in "if isinstance(xxxxx, unicode).

An image of the custom column configuration in Calibre > Preferences is also attached. It is simply Text.

An image of sqlite table custom_columns within metadata.db is also attached, showing exactly what metadata.db has for the custom column.

The "if isinstance(xxxxx, unicode) test you had me do several posts ago proves beyond a shadow of a doubt that the data is unicode. It is also intuitively obvious that u'N\xe3o-fic\xe7\xe3o' is unicode, anyway, given its structure. It prints everywhere that can handle virtually any code page above ascii as não-ficção. Python 2.7 IDLE prints it properly, of course.

I know that the Calibre gui handles "Unicode-Subset-UTF8", because I have copied and pasted não-ficção into the Tag field for a few test ebooks. Ditto for book Title and book Author. See the attached image.

I know that SQLite handles "Unicode-Subset-UTF8". See the attached image that proves that.

Per your request, I did a repr(u_genre) in IDLE. See the attached image. It returned: "u'N\\xe3o-fic\\xe7\\xe3o'" Repr wrapped it in double quotes.

Per your request, I did a repr(u_genre) in my plugin in ui.py . See the attached image. Repr made a unicode object out of a unicode object. It wrapped u'N\xe3o-fic\xe7\xe3o' in the standard u'xxxxxxxxxxxx' form. See the attached.

I tried wrapping the utf8 variable in double quotes prior to updating it in mi.set_user_metadata('#genre', custcol) , but all that did was turn a unicode object into a unicode object wrapped in double quotes. Better than what repr() did, though, since there was only a single u'.

So, in short, I have pure "Unicode-Subset-UTF8" data that came from Unicode(UTF-8), was INSERTED into metadata.db by SQLite, then was fetched from metadata.db by SQLite and put into a Python dictionary just the way it arrived from metadata.db via SQLite. That Python dictionary is the source of u_genre and the book to which it belongs, and as the "if isinstance(u_genre, unicode)" test proves, they were still "Unicode-Subset-UTF8" objects when given to mi.set_user_metadata to update.

What does this all mean? mi.set_user_metadata is turning a "Unicode-Subset-UTF8" object into a byte string before storing it in metadata.db. See the attached image of the raw SQLite data in metadata.db. The values are identical to the debug print output in ui.py immediately prior to the mi.set_user_metadata statement.


There is a lot of objective proof here that mi.set_user_metadata is not "Unicode-Subset-UTF8" friendly. Given that "Unicode-Subset-UTF8" is by far the most common subset, at least according to the folks at https://docs.python.org/2/howto/unicode.html . They say:

UTF-8 has several convenient properties:
* It can handle any Unicode code point.
* A Unicode string is turned into a string of bytes containing no embedded zero bytes. This avoids byte-ordering issues, and means UTF-8 strings can be processed by C functions such as strcpy() and sent through protocols that can’t handle zero bytes.
* A string of ASCII text is also valid UTF-8 text.
* UTF-8 is fairly compact; the majority of code points are turned into two bytes, and values less than 128 occupy only a single byte.
* If bytes are corrupted or lost, it’s possible to determine the start of the next UTF-8-encoded code point and resynchronize. It’s also unlikely that random 8-bit data will look like valid UTF-8.


In summary, there is a lot of objective proof attached to this post that mi.set_user_metadata likely should be enhanced to be "Unicode-Subset-UTF8" friendly.
Attached Thumbnails
Click image for larger version

Name:	unicode_author_title_tag_genre_in_Calibre.JPG
Views:	277
Size:	50.8 KB
ID:	125744   Click image for larger version

Name:	custom_column_configuration.JPG
Views:	256
Size:	21.6 KB
ID:	125746   Click image for larger version

Name:	custom_column_table_metadata_db.JPG
Views:	247
Size:	42.0 KB
ID:	125747   Click image for larger version

Name:	csv_in_unicode_utf8_encoding.JPG
Views:	252
Size:	89.0 KB
ID:	125748   Click image for larger version

Name:	utf8_in_calibre_metadata_db.JPG
Views:	299
Size:	68.8 KB
ID:	125749   Click image for larger version

Name:	IDLE_repr_u_genre_proof.JPG
Views:	253
Size:	18.0 KB
ID:	125750   Click image for larger version

Name:	custom_column_9_table_in_metadata_db.JPG
Views:	268
Size:	89.6 KB
ID:	125751   Click image for larger version

Name:	debug_output_of_repr_in_ui_py.JPG
Views:	264
Size:	177.4 KB
ID:	125752  

Last edited by DaltonST; 07-23-2014 at 01:50 PM.
DaltonST is offline   Reply With Quote
Reply

Tags
custcol['#value#'], set_user_metadata, unicode


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
PRS-T1 PRS-T1 & Asian Fonts/Unicode komugi Sony Reader 20 10-05-2013 11:49 PM
RegEx & Unicode capnm Library Management 14 12-01-2011 08:23 PM


All times are GMT -4. The time now is 11:00 PM.


MobileRead.com is a privately owned, operated and funded community.