07-06-2014, 04:59 AM | #1 |
Wizard
Posts: 1,760
Karma: 9918418
Join Date: Feb 2013
Location: Here on the perimeter, there are no stars
Device: Kobo H2O, iPad mini 3, Kindle Touch
|
Unicode issues
For reference, there's more info about this issue over in the "Modify ePub" thread. Here's the gist of it, though...
I've added a pair of very similar routines to the Modify ePub plugin, both related to removing excess gunk from EPUB books. However, in some circumstances, I'm running into a Unicode translation issue that I've been unable to solve. I've tried explicitly converting both items to Unicode during the comparison, along with a few similar approaches, but nothing seems to work and I've hit a wall. Can anyone lend a hand, if only by simply pointing me in the right direction? There's a bug report here*, and my first link above is to a post with the latest version of the code. That's not the version I'm currently using, though; I've reverted to this version until I can find a working Unicode fix. * As successive posts make clear, the bug reporter is not actually trying to "remove kobo drm," but rather to remove bloated code left behind in ex-kepubs after removal of the Kobo DRM by another tool - which, of course, shall remain nameless. The Modify ePub plugin itself does not remove DRM, but the whole point of one of the new routines is to clean up ex-kepubs and get them closer to being true EPUBs, just as the non-beta version of the plugin already does for the META tags left behind by ADEPT DRM. |
07-06-2014, 05:09 AM | #2 |
creator of calibre
Posts: 43,860
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
When you compare a unicode string to a bytestring in python, python will try to autoconvert the bytestring into unicode for the comparison, using a "default" encoding that is system dependent.
You should almost always manually decode strings that come from text files before doing anything with them. There are routines to help you do that in the calibre.ebooks.chardet module. Or see the implementation of the decode() method http://manual.calibre-ebook.com/poli...ntainer.decode |
Advert | |
|
07-20-2014, 03:26 PM | #3 |
Deviser
Posts: 2,265
Karma: 2090983
Join Date: Aug 2013
Location: Texas
Device: none
|
Re: Unicode Issues, Comparing Strings, ISO Encoding, SQL and Other Maladies
After having a multitude of unicode issues dealing with comparison of strings between utf-8 and iso8859-15, and generally with creating sql statements using variables whose values originated in metadata.db and were causing runtime failures, I stumbled upon this little gem which I would like to share with the forum. I have not seen this syntax in any other documentation anywhere. See: http://stackoverflow.com/questions/2...15-with-python >>> a = 'ü' >>> a.decode('utf8') # terminal is configured to use UTF-8 by default u'\xfc' >>> a.decode('utf8').encode('iso8859-15') '\xfc' So, the secret to keep Python 2 from "covertly" decoding to ascii before it re-encodes (or tries) to iso8859-15 (and hence losing all the non-ascii characters in the process, such as those in 'não-ficção') is to use this syntax: >>>>>> a.decode('utf8').encode('iso8859-15') <<<<<<<<<< |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Non-Roman Unicode Characters | teh603 | Writers' Corner | 7 | 03-26-2012 11:06 AM |
Unicode support in K3 | tomsem | Amazon Kindle | 22 | 09-02-2010 04:14 PM |
Charging Issues and Screen Issues | srj321 | Sony Reader | 2 | 07-11-2010 11:52 PM |
PRS-500 Unicode Enabled RTF | Honza | Sony Reader Dev Corner | 33 | 03-31-2010 09:45 AM |
Unicode errors in isbndb | JvdW | Calibre | 3 | 08-01-2008 05:07 AM |