Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Development

Notices

Reply
 
Thread Tools Search this Thread
Old 07-06-2014, 04:59 AM   #1
Rev. Bob
Wizard
Rev. Bob ought to be getting tired of karma fortunes by now.Rev. Bob ought to be getting tired of karma fortunes by now.Rev. Bob ought to be getting tired of karma fortunes by now.Rev. Bob ought to be getting tired of karma fortunes by now.Rev. Bob ought to be getting tired of karma fortunes by now.Rev. Bob ought to be getting tired of karma fortunes by now.Rev. Bob ought to be getting tired of karma fortunes by now.Rev. Bob ought to be getting tired of karma fortunes by now.Rev. Bob ought to be getting tired of karma fortunes by now.Rev. Bob ought to be getting tired of karma fortunes by now.Rev. Bob ought to be getting tired of karma fortunes by now.
 
Rev. Bob's Avatar
 
Posts: 1,760
Karma: 9918418
Join Date: Feb 2013
Location: Here on the perimeter, there are no stars
Device: Kobo H2O, iPad mini 3, Kindle Touch
Unicode issues

For reference, there's more info about this issue over in the "Modify ePub" thread. Here's the gist of it, though...

I've added a pair of very similar routines to the Modify ePub plugin, both related to removing excess gunk from EPUB books. However, in some circumstances, I'm running into a Unicode translation issue that I've been unable to solve. I've tried explicitly converting both items to Unicode during the comparison, along with a few similar approaches, but nothing seems to work and I've hit a wall. Can anyone lend a hand, if only by simply pointing me in the right direction?

There's a bug report here*, and my first link above is to a post with the latest version of the code. That's not the version I'm currently using, though; I've reverted to this version until I can find a working Unicode fix.

* As successive posts make clear, the bug reporter is not actually trying to "remove kobo drm," but rather to remove bloated code left behind in ex-kepubs after removal of the Kobo DRM by another tool - which, of course, shall remain nameless. The Modify ePub plugin itself does not remove DRM, but the whole point of one of the new routines is to clean up ex-kepubs and get them closer to being true EPUBs, just as the non-beta version of the plugin already does for the META tags left behind by ADEPT DRM.
Rev. Bob is offline   Reply With Quote
Old 07-06-2014, 05:09 AM   #2
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,860
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
When you compare a unicode string to a bytestring in python, python will try to autoconvert the bytestring into unicode for the comparison, using a "default" encoding that is system dependent.

You should almost always manually decode strings that come from text files before doing anything with them. There are routines to help you do that in the calibre.ebooks.chardet module.

Or see the implementation of the decode() method http://manual.calibre-ebook.com/poli...ntainer.decode
kovidgoyal is offline   Reply With Quote
Advert
Old 07-20-2014, 03:26 PM   #3
DaltonST
Deviser
DaltonST ought to be getting tired of karma fortunes by now.DaltonST ought to be getting tired of karma fortunes by now.DaltonST ought to be getting tired of karma fortunes by now.DaltonST ought to be getting tired of karma fortunes by now.DaltonST ought to be getting tired of karma fortunes by now.DaltonST ought to be getting tired of karma fortunes by now.DaltonST ought to be getting tired of karma fortunes by now.DaltonST ought to be getting tired of karma fortunes by now.DaltonST ought to be getting tired of karma fortunes by now.DaltonST ought to be getting tired of karma fortunes by now.DaltonST ought to be getting tired of karma fortunes by now.
 
DaltonST's Avatar
 
Posts: 2,265
Karma: 2090983
Join Date: Aug 2013
Location: Texas
Device: none
Re: Unicode Issues, Comparing Strings, ISO Encoding, SQL and Other Maladies

After having a multitude of unicode issues dealing with comparison of strings between utf-8 and iso8859-15, and generally with creating sql statements using variables whose values originated in metadata.db and were causing runtime failures, I stumbled upon this little gem which I would like to share with the forum. I have not seen this syntax in any other documentation anywhere. See: http://stackoverflow.com/questions/2...15-with-python

>>> a = 'ü'
>>> a.decode('utf8') # terminal is configured to use UTF-8 by default
u'\xfc'
>>> a.decode('utf8').encode('iso8859-15')
'\xfc'

So, the secret to keep Python 2 from "covertly" decoding to ascii before it re-encodes (or tries) to iso8859-15 (and hence losing all the non-ascii characters in the process, such as those in 'não-ficção') is to use this syntax:

>>>>>> a.decode('utf8').encode('iso8859-15') <<<<<<<<<<
DaltonST is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Non-Roman Unicode Characters teh603 Writers' Corner 7 03-26-2012 11:06 AM
Unicode support in K3 tomsem Amazon Kindle 22 09-02-2010 04:14 PM
Charging Issues and Screen Issues srj321 Sony Reader 2 07-11-2010 11:52 PM
PRS-500 Unicode Enabled RTF Honza Sony Reader Dev Corner 33 03-31-2010 09:45 AM
Unicode errors in isbndb JvdW Calibre 3 08-01-2008 05:07 AM


All times are GMT -4. The time now is 02:21 PM.


MobileRead.com is a privately owned, operated and funded community.