Unicode issues

Rev. Bob · 07-06-2014, 04:59 AM

For reference, there's more info about this issue over in the "Modify ePub" thread. Here's the gist of it, though...

I've added a pair of very similar routines to the Modify ePub plugin, both related to removing excess gunk from EPUB books. However, in some circumstances, I'm running into a Unicode translation issue that I've been unable to solve. I've tried explicitly converting both items to Unicode during the comparison, along with a few similar approaches, but nothing seems to work and I've hit a wall. Can anyone lend a hand, if only by simply pointing me in the right direction?

There's a bug report here*, and my first link above is to a post with the latest version of the code. That's not the version I'm currently using, though; I've reverted to this version until I can find a working Unicode fix.

* As successive posts make clear, the bug reporter is not actually trying to "remove kobo drm," but rather to remove bloated code left behind in ex-kepubs after removal of the Kobo DRM by another tool - which, of course, shall remain nameless. The Modify ePub plugin itself does not remove DRM, but the whole point of one of the new routines is to clean up ex-kepubs and get them closer to being true EPUBs, just as the non-beta version of the plugin already does for the META tags left behind by ADEPT DRM.

kovidgoyal · 07-06-2014, 05:09 AM

When you compare a unicode string to a bytestring in python, python will try to autoconvert the bytestring into unicode for the comparison, using a "default" encoding that is system dependent.

You should almost always manually decode strings that come from text files before doing anything with them. There are routines to help you do that in the calibre.ebooks.chardet module.

Or see the implementation of the decode() method http://manual.calibre-ebook.com/poli...ntainer.decode

DaltonST · 07-20-2014, 03:26 PM

Re: Unicode Issues, Comparing Strings, ISO Encoding, SQL and Other Maladies

After having a multitude of unicode issues dealing with comparison of strings between utf-8 and iso8859-15, and generally with creating sql statements using variables whose values originated in metadata.db and were causing runtime failures, I stumbled upon this little gem which I would like to share with the forum. I have not seen this syntax in any other documentation anywhere. See: http://stackoverflow.com/questions/2...15-with-python

>>> a = 'ü'
>>> a.decode('utf8') # terminal is configured to use UTF-8 by default
u'\xfc'
>>> a.decode('utf8').encode('iso8859-15')
'\xfc'

So, the secret to keep Python 2 from "covertly" decoding to ascii before it re-encodes (or tries) to iso8859-15 (and hence losing all the non-ascii characters in the process, such as those in 'não-ficção') is to use this syntax:

>>>>>> a.decode('utf8').encode('iso8859-15') <<<<<<<<<<

07-06-2014, 04:59 AM	#1
Rev. Bob Wizard Posts: 1,760 Karma: 9918418 Join Date: Feb 2013 Location: Here on the perimeter, there are no stars Device: Kobo H2O, iPad mini 3, Kindle Touch	Unicode issues For reference, there's more info about this issue over in the "Modify ePub" thread. Here's the gist of it, though... I've added a pair of very similar routines to the Modify ePub plugin, both related to removing excess gunk from EPUB books. However, in some circumstances, I'm running into a Unicode translation issue that I've been unable to solve. I've tried explicitly converting both items to Unicode during the comparison, along with a few similar approaches, but nothing seems to work and I've hit a wall. Can anyone lend a hand, if only by simply pointing me in the right direction? There's a bug report here, and my first link above is to a post with the latest version of the code. That's not the version I'm currently using, though; I've reverted to this version until I can find a working Unicode fix. As successive posts make clear, the bug reporter is not actually trying to "remove kobo drm," but rather to remove bloated code left behind in ex-kepubs after removal of the Kobo DRM by another tool - which, of course, shall remain nameless. The Modify ePub plugin itself does not remove DRM, but the whole point of one of the new routines is to clean up ex-kepubs and get them closer to being true EPUBs, just as the non-beta version of the plugin already does for the META tags left behind by ADEPT DRM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Non-Roman Unicode Characters	teh603	Writers' Corner	7	03-26-2012 11:06 AM
Unicode support in K3	tomsem	Amazon Kindle	22	09-02-2010 04:14 PM
Charging Issues and Screen Issues	srj321	Sony Reader	2	07-11-2010 11:52 PM
PRS-500 Unicode Enabled RTF	Honza	Sony Reader Dev Corner	33	03-31-2010 09:45 AM
Unicode errors in isbndb	JvdW	Calibre	3	08-01-2008 05:07 AM

07-06-2014, 05:09 AM	#2
kovidgoyal creator of calibre Posts: 43,860 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	When you compare a unicode string to a bytestring in python, python will try to autoconvert the bytestring into unicode for the comparison, using a "default" encoding that is system dependent. You should almost always manually decode strings that come from text files before doing anything with them. There are routines to help you do that in the calibre.ebooks.chardet module. Or see the implementation of the decode() method http://manual.calibre-ebook.com/poli...ntainer.decode

07-20-2014, 03:26 PM	#3
DaltonST Deviser Posts: 2,265 Karma: 2090983 Join Date: Aug 2013 Location: Texas Device: none	Re: Unicode Issues, Comparing Strings, ISO Encoding, SQL and Other Maladies After having a multitude of unicode issues dealing with comparison of strings between utf-8 and iso8859-15, and generally with creating sql statements using variables whose values originated in metadata.db and were causing runtime failures, I stumbled upon this little gem which I would like to share with the forum. I have not seen this syntax in any other documentation anywhere. See: http://stackoverflow.com/questions/2...15-with-python >>> a = 'ü' >>> a.decode('utf8') # terminal is configured to use UTF-8 by default u'\xfc' >>> a.decode('utf8').encode('iso8859-15') '\xfc' So, the secret to keep Python 2 from "covertly" decoding to ascii before it re-encodes (or tries) to iso8859-15 (and hence losing all the non-ascii characters in the process, such as those in 'não-ficção') is to use this syntax: >>>>>> a.decode('utf8').encode('iso8859-15') <<<<<<<<<<

Advert