Problem with Unicode masthead text

pietvo · 12-26-2011, 06:26 PM

I am working on a new recipe for the newspaper La Razón, Bolivia, as the old one doesn't work anymore, while the website changed. I have to new recipe working mostly, it just needs some esthetic changes. I will post it when it is finished.

The website of the paper doesn't have a suitable masthead image anymore so I used

Code:

    def get_masthead_title(self):
        return u'La Razón'

However the masthead image generator uses the title utf-8 encoded which gives strange characters instead of the 'ó'.

The PIL draw text method doesn't expect a utf-8 encoded byte string but it accepts a normal Unicode string. So I changed generate_masthead in calibre/ebooks/__init__.py to eliminate the conversion to utf-8 and that solved the problem. Essentially the text = title.encode('utf-8') should be eliminated and title be used instead of text.

Here is the diff:

Code:

diff -u /Applications/calibre.app/Contents/Resources/Python/site-packages/calibre/ebooks/__init__.py.\~1\~ /Applications/calibre.app/Contents/Resources/Python/site-packages/calibre/ebooks/__init__.py
--- /Applications/calibre.app/Contents/Resources/Python/site-packages/calibre/ebooks/__init__.py.~1~    2011-12-26 18:40:30.000000000 +0100
+++ /Applications/calibre.app/Contents/Resources/Python/site-packages/calibre/ebooks/__init__.py    2011-12-26 23:03:10.000000000 +0100
@@ -240,11 +240,10 @@
         font = ImageFont.truetype(font_path, 48)
     except:
         font = ImageFont.truetype(default_font, 48)
-    text = title.encode('utf-8')
-    width, height = draw.textsize(text, font=font)
+    width, height = draw.textsize(title, font=font)
     left = max(int((width - width)/2.), 0)
     top = max(int((height - height)/2.), 0)
-    draw.text((left, top), text, fill=(0,0,0), font=font)
+    draw.text((left, top), title, fill=(0,0,0), font=font)
     if output_path is None:
         f = StringIO()
         img.save(f, 'JPEG')

kiklop74 · 12-26-2011, 06:51 PM

You could let me know recipe needed update since I wrote the original one.

kiklop74 · 12-26-2011, 09:15 PM

Updated recipe will be included in the next release of Calibre

https://bugs.launchpad.net/calibre/+bug/908912

pietvo · 12-27-2011, 04:52 AM

Quote:

Originally Posted by kiklop74

You could let me know recipe needed update since I wrote the original one.

You are right, but I hadn't noticed your email address. I only noticed it after I finished the recipe. Anyway, this was my first recipe so it was a good exercise for me. Thanks for supplying a new one.

However, this topic is about the problem with Unicode masthead titles. Shall I make a bug report for it?

By the way, I spent a very nice holiday in Argentina last year: 4 days in BA and 4 days in Iguazu, on the road to Bolivia.

kovidgoyal · 12-27-2011, 05:31 AM

IIRC PIL requires UTF-8 encoded text. There's probably some double encoding going on somewhere. Try this instead:

text = title.encode('utf-8') if isinstance(title, unicode) else title

And also use

u'La Raz\xc3\xb3n' instead of u'La Razón'

pietvo · 12-27-2011, 11:10 AM

Quote:

Originally Posted by kovidgoyal

IIRC PIL requires UTF-8 encoded text. There's probably some double encoding going on somewhere.

Actually I tried it out with my patch. And it appears that PIL happily accepted the Unicode text and generated a proper image. Moreover, doing the utf-8 encoding as the current code does gives the wrong result. I think it is Qt that requires utf-8, not PIL.
The PIL documentation contains an example where Unicode text is given to draw:

Code:

 
font = ImageFont.truetype("symbol.ttf", 16, encoding="symb")     
draw.text((0, 0), unichr(0xF000 + 0xAA))

I think it should have an additional parameter font=font

Quote:

Originally Posted by kovidgoyal

There's probably some double encoding going on somewhere. Try this instead:

text = title.encode('utf-8') if isinstance(title, unicode) else title

And also use

u'La Raz\xc3\xb3n' instead of u'La Razón'

That seems very wrong to me (forcing a double encoding yourself).

kovidgoyal · 12-27-2011, 11:34 AM

That code is in there for a reason. And u'La Raz\xc3\xb3n' is not a double encoding, it is an ascii representation to ensure your problem isn't coming from the python interpreter parsing the .py file incorrectly.

pietvo · 12-27-2011, 04:57 PM

I don't know the reason the code is there. Maybe there used to be a reason which is no longer valid.

And u'La Raz\xc3\xb3n' is a double encoding, it is an not an ascii representation of u'La Razón'. That would be u'La Raz\xf3n', which is probably what you meant. What you have written is a utf-8 encoding of ó, and that put in an ascii representation in a Unicode string. Putting utf-8 bytes in a Unicode string is most of the times wrong. I I print that string it outputs La RazÃ³n, which is exactly the text I got in the masthead image, showing that the utf-8 encoding that the code does should not be done. And the parser didn't parse it incorrectly because I had a # -*- coding: utf-8 -*- line and saved the file in utf-8. To safeguard against source code problems indeed \x3f could be used but then the title.encode('utf-8') would still cause the wrong rendering.

Fredrik Lundh, the author of PIL also says that the text can be a Unicode string if the font you use supports Unicode. Here is an example the you can try to see that it works.

Quote:

# -*- coding: utf-8 -*-
import ImageFont, Image, ImageDraw
s = u'La Razón € ñ'
font = ImageFont.truetype('/System/Library/Fonts/LucidaGrande.ttc', 18, encoding='unic')
print font.getsize(s)
im = Image.new('RGB', (200,200))
draw = ImageDraw.Draw(im)
draw.text((40,40), s, font=font)
im.show()

kovidgoyal · 12-27-2011, 11:40 PM

Fix committed.

12-27-2011, 05:31 AM	#5
kovidgoyal creator of calibre Posts: 45,609 Karma: 28549044 Join Date: Oct 2006 Location: Mumbai, India Device: Various	IIRC PIL requires UTF-8 encoded text. There's probably some double encoding going on somewhere. Try this instead: text = title.encode('utf-8') if isinstance(title, unicode) else title And also use u'La Raz\xc3\xb3n' instead of u'La Razón' Last edited by kovidgoyal; 12-27-2011 at 05:34 AM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Problem with masthead image	kindle3reader	Kindle Formats	0	02-03-2011 05:29 PM
Unicode characters OK in text but wrong in TOC	paulpeer	ePub	8	01-15-2010 07:17 PM
Using Unicode Fonts for darker text	Damætas	Kindle Developer's Corner	11	04-19-2009 04:44 PM
Converting non-ascii/non-unicode text - pictures the way to go?	politicorific	Workshop	5	04-02-2009 06:59 AM
Problem with preprocess_regexps and Unicode	mccande	Calibre	8	12-19-2008 10:26 AM

12-26-2011, 06:51 PM	#2
kiklop74 Guru Posts: 800 Karma: 194644 Join Date: Dec 2007 Location: Argentina Device: Kindle Voyage	You could let me know recipe needed update since I wrote the original one.

12-26-2011, 09:15 PM	#3
kiklop74 Guru Posts: 800 Karma: 194644 Join Date: Dec 2007 Location: Argentina Device: Kindle Voyage	Updated recipe will be included in the next release of Calibre https://bugs.launchpad.net/calibre/+bug/908912

12-27-2011, 11:34 AM	#7
kovidgoyal creator of calibre Posts: 45,609 Karma: 28549044 Join Date: Oct 2006 Location: Mumbai, India Device: Various	That code is in there for a reason. And u'La Raz\xc3\xb3n' is not a double encoding, it is an ascii representation to ensure your problem isn't coming from the python interpreter parsing the .py file incorrectly.

12-27-2011, 11:40 PM	#9
kovidgoyal creator of calibre Posts: 45,609 Karma: 28549044 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Fix committed.

Advert

Advert