Reciper for www.inter.it - some letters are wrong!

Sciamano · 03-24-2011, 04:57 PM

Hi everyone,
I'm new to the boards and need some help with a recipe.
I'm going to be a commuter soon, so I wanted to create a recipe to download all the news that get published on my favorite (italian) soccer team's website.

This is the link to the RSS feed:
http://veleno.inter.it/aas/rss/index_full_it.xml

I've created this very simple custom recipe:

Code:

class AdvancedUserRecipe1300997108(BasicNewsRecipe):
    title          = u'Inter'
    oldest_article = 7
    max_articles_per_feed = 100

    feeds          = [(u'Inter News', u'http://veleno.inter.it/aas/rss/index_full_it.xml')]
    remove_tags    = [dict(name='div', attrs={'class':'piccolowww'})]

It seems to work fine, except for one little thing: where the article starts, and the date (day of the week, date, time) of the article is written, some letters in the ebook are changed.

For example, this is what it should read for today's news:
Giovedì, 24 Marzo 2011 14:44:03

But this is what I find in the resulting eBook:
Giovedě, 24 Marzo 2011 14:44:03

(see? the "ì" has been transformed to "ě")

Not a big deal, I can live with that, but since I'm a perfectionist, I'd like to solve.
Also if someone helps me remove the rss logo images and "permalink" link after the date, it would be great! I've tried but was not succesful.

Thanks!!

oneillpt · 03-24-2011, 06:32 PM

Quote:

Originally Posted by Sciamano

Hi everyone,
I'm new to the boards and need some help with a recipe.
I'm going to be a commuter soon, so I wanted to create a recipe to download all the news that get published on my favorite (italian) soccer team's website.

This is the link to the RSS feed:
http://veleno.inter.it/aas/rss/index_full_it.xml

I've created this very simple custom recipe:

Code:

class AdvancedUserRecipe1300997108(BasicNewsRecipe):
    title          = u'Inter'
    oldest_article = 7
    max_articles_per_feed = 100

    feeds          = [(u'Inter News', u'http://veleno.inter.it/aas/rss/index_full_it.xml')]
    remove_tags    = [dict(name='div', attrs={'class':'piccolowww'})]

It seems to work fine, except for one little thing: where the article starts, and the date (day of the week, date, time) of the article is written, some letters in the ebook are changed.

For example, this is what it should read for today's news:
Giovedì, 24 Marzo 2011 14:44:03

But this is what I find in the resulting eBook:
Giovedě, 24 Marzo 2011 14:44:03

(see? the "ì" has been transformed to "ě")

Not a big deal, I can live with that, but since I'm a perfectionist, I'd like to solve.
Also if someone helps me remove the rss logo images and "permalink" link after the date, it would be great! I've tried but was not succesful.

Thanks!!

Add a line specifying encoding to your recipe:

Code:

class AdvancedUserRecipe1300997108(BasicNewsRecipe):
    title          = u'Inter'
    encoding  = 'ISO-8859-15'
    oldest_article = 7
    max_articles_per_feed = 100

    feeds          = [(u'Inter News', u'http://veleno.inter.it/aas/rss/index_full_it.xml')]
    remove_tags    = [dict(name='div', attrs={'class':'piccolowww'})]

and this problem should be solved.

Sciamano · 03-24-2011, 06:42 PM

Thanks!! It worked! Karma+!
Any chance you can help me with removing the rss links and permalink after the date? That would be awesome!

Starson17 · 03-24-2011, 07:46 PM

Quote:

Originally Posted by oneillpt

Add a line specifying encoding to your recipe:

Code:

    encoding  = 'ISO-8859-15'

and this problem should be solved.

FYI, the site seems to specify encoding = 'ISO-8859-1', not 'ISO-8859-15'. The difference is subtle, but they differ in a few spots, particularly as to the Euro symbol. If he sees a missing Euro symbol, that's why.

Sciamano · 03-25-2011, 08:24 AM

Thanks, I'll put ISO-8859-1 then.

oneillpt · 03-25-2011, 09:28 AM

Quote:

Originally Posted by Starson17

FYI, the site seems to specify encoding = 'ISO-8859-1', not 'ISO-8859-15'. The difference is subtle, but they differ in a few spots, particularly as to the Euro symbol. If he sees a missing Euro symbol, that's why.

When I look at the source for http://veleno.inter.it/aas/rss/index_full_it.xml I see:

Code:

<?xml version="1.0" encoding="ISO-8859-15"?>

I followed one of the story links but the source for that page did not show any specified encoding. Using ISO-8859-1 does also seem to solve the character problem for Giovedě -> Giovedì, but unfortunately the only references I could find today to the Euro were to amounts specified as 'xx euro', not using the Euro symbol. Perhaps when the next transfer is reported the Euro symbol will be used and provide a test to decide which encoding should be used.

Sciamano · 03-25-2011, 09:35 AM

I'll specify ISO-8859-15 until the Euro symbol comes up (but I doubt it ever will)

Starson17 · 03-25-2011, 09:48 AM

Quote:

Originally Posted by oneillpt

When I look at the source for http://veleno.inter.it/aas/rss/index_full_it.xml I see:

Code:

<?xml version="1.0" encoding="ISO-8859-15"?>

I looked at the character encoding specified in the HTTP headers for an article I followed. In fact, here it is again for the first article listed in the feed:

Code:

Content-Type: text/html; charset=ISO-8859-1

(I used FireFox and the Live HTTP Headers plugin) I suspected you'd found a different encoding listed somewhere in the source, so I wasn't trying to disagree with you, just to point out what I'd found and what symptoms the difference might cause to appear. The 8859 family is pretty uniform, but 8859-1 is far more common than 8859-15, so I thought I'd mention what I'd seen.

In fact, part of the reason I looked was to see how you might have come up with the answer. I don't often have character encoding problems (mostly I work in English), so I was wondering if you'd found your answer in source or HTTP headers.

oneillpt · 03-25-2011, 09:17 PM

Quote:

Originally Posted by Starson17

I looked at the character encoding specified in the HTTP headers for an article I followed. In fact, here it is again for the first article listed in the feed:

Code:

Content-Type: text/html; charset=ISO-8859-1

...

In fact, part of the reason I looked was to see how you might have come up with the answer. I don't often have character encoding problems (mostly I work in English), so I was wondering if you'd found your answer in source or HTTP headers.

I looked at the source for the initial feed used in the recipe, http://veleno.inter.it/aas/rss/index_full_it.xml, fuller extract follows:

Code:

<?xml version="1.0" encoding="ISO-8859-15"?>
<rss version="2.0" 
  xmlns:dc="http://purl.org/dc/elements/1.1/"
  xmlns:syn="http://purl.org/rss/1.0/modules/syndication/"
  xmlns:atom="http://www.w3.org/2005/Atom"
>

<channel>
                                                                                                                              
        <title>INTER.IT - IT full</title>
        <link>http://www.inter.it/</link>
        <language>it</language>
	<description>Le notizie ufficiali di inter.it</description>
	<copyright>Copyright 2010 Football Club Internazionale Milano Spa</copyright>

        <atom:link href="http://www.inter.it/aas/rss/index_full_it.xml" rel="self" type="application/rss+xml" />
<item>
<dc:date>2011-03-26T00:10:23+01:00</dc:date><title>Inter Channel: "7 su 7" e non solo...</title>
<description><![CDATA[<img src="http://www.inter.it/aas/img/143867.jpg"><br><br><p><strong>APPIANO GENTILE</strong> - Non perdere gli appuntamenti odierni con il canale tematico nerazzurro: si comincia con la <em>Rassegna Stampa</em>, alle ore 13.30, a cura di Nagaja Beccalossi, mentre alle 19.30 l'appuntamento &egrave; con <em>Internews</em>, in studio Alessandro Villa.</p>
<p>Inoltre, alle 17 e in replica alle 23, torna &quot;7 su 7&quot;, la rubrica a cura della redazione che ci riassume i fatti principali dal 19 marzo ad oggi.</p>
<p>&nbsp;</p><br><br>]]></description>
<link>http://www.inter.it/aas/news/reader?N=52072&amp;L=it</link>
<guid>http://www.inter.it/aas/news/reader?N=52072&amp;L=it</guid>
</item>
<item>
<dc:date>2011-03-25T23:03:34+01:00</dc:date><title>Thiago Motta: "Grazie Italia, così sono felice"</title>
...

although the Page Info for this page shown by FireFox using the right-click context menu is UTF-8. When I follow links I do find that the Page Info indicates ISO-8859-1, although the actual source contains no encoding declaration.

I'm more inclined to trust the explicit declaration as ISO-8859-15 in the initial page, and to assume that subsequent pages will have been produced using the same encoding. The encoding reported in the headers will depend on the server configuration, and may or may not be reliable.

oneillpt · 03-25-2011, 09:23 PM

Quote:

Originally Posted by Sciamano

Thanks!! It worked! Karma+!
Any chance you can help me with removing the rss links and permalink after the date? That would be awesome!

This works:

Code:

class AdvancedUserRecipe1300997108(BasicNewsRecipe):
    title          = u'Inter'
    encoding  = 'ISO-8859-15'
    oldest_article = 7
    max_articles_per_feed = 100

    feeds          = [(u'Inter News', u'http://veleno.inter.it/aas/rss/index_full_it.xml')]
    remove_tags    = [dict(name='div', attrs={'class':'piccolowww'})]
    remove_tags = [dict(name='span', attrs={'style':'padding-left:120px;'})]

but runs the risk of failure if style="padding-left:120px;" is used with a <span> tag elsewhere in some future page, rather than as now just surrounding these unwanted items.

Sciamano · 03-27-2011, 02:30 PM

Quote:

Originally Posted by oneillpt

This works:

but runs the risk of failure if style="padding-left:120px;" is used with a <span> tag elsewhere in some future page, rather than as now just surrounding these unwanted items.

Yes, I've tried that too, but then I put it aside because I noticed that it would possibly delete also other parts of the page. Sorry I did not specify this before.
I asked here because I'm a recipe newbie, and thought that there might have been a better way to achieve this, which I couldn't find.
If this is the only solution available, I'll keep those images and permalink, it's not a big deal. It was more a curiosity.
Thanks, though!

Starson17 · 03-27-2011, 05:41 PM

Quote:

Originally Posted by Sciamano

If this is the only solution available, I'll keep those images and permalink, it's not a big deal.

You said it was "after a date," so you can look into BeautifulSoup's Next or NextSibling method of locating a tag relative to another tag, such as the date.

Sciamano · 03-28-2011, 05:44 AM

Quote:

Originally Posted by Starson17

You said it was "after a date," so you can look into BeautifulSoup's Next or NextSibling method of locating a tag relative to another tag, such as the date.

I tried taking a look at these, but they look way too complicated for my less-than-basic skills. I think I'll just cope with the images and the permalink.
Thanks anyway.

oneillpt · 03-28-2011, 09:59 PM

Quote:

Originally Posted by oneillpt

...

although the Page Info for this page shown by FireFox using the right-click context menu is UTF-8. When I follow links I do find that the Page Info indicates ISO-8859-1, although the actual source contains no encoding declaration.

I'm more inclined to trust the explicit declaration as ISO-8859-15 in the initial page, and to assume that subsequent pages will have been produced using the same encoding. The encoding reported in the headers will depend on the server configuration, and may or may not be reliable.

A few further comments on the choice of encoding. Although I said above that I'm more inclined to trust the explicit encoding declaration in preference to the HTTP headers, it is not unusual to find the explicit declaration is also wrong, as in the examples further below. For www.inter.it I would suspect that ISO-8859-15 is more likely than ISO-8859-1, but in fact Windows-1252 might in fact be the true encoding, even though it is neither explicitly declared nor returned by the HTTP headers, and cannot be confirmed nor ruled out on the basis of the web content I have seen so far.

First, why do I suggest ISO-8859-15 is more likely that ISO-8859-1? Quite simply, for countries within the Euro zone, such as Italy, the Euro symbol is likely to occur in news text. The Euro sign and a few characters used in Finnish and French are missing in ISO-8859-1. ISO-8859-15 updates ISO-8859-1 by introducing these characters, replacing a few infrequently used characters in ISO-8859-1. So any Finnish or French text which may require the characters missing from ISO-8859-1, or text using the Euro symbol, is likely to be ISO-8859-15, Windows-1252 or UTF-8, not ISO-8859-1, even if it explicitly claims to be ISO-8859-1 or arrives with HTTP headers claiming ISO-8859-1. If fact, if it uses the Euro symbol, it cannot be ISO-8859-1.

Then why Windows-1252? Text encodes as Windows-1252 is often wrongly described as ISO-8859-1. Windows-1252 is a superset of ISO-8859-1 which includes the additional characters introduced in ISO-8859-15, but mapped differently, to byte codes which were unused in ISO-8859-1, rather than replacing infrequently characters in ISO-8859-1. For example, the Euro symbol is mapped to 0x80 in Windows-1252, and to 0xA4 in ISO-8859-15, where it causes an infrequently used character to be dropped. A hex editor can be used to check whether the encoding is Windows-1252 or ISO-8859-15 if any of these additional characters is present. For example, if the Euro symbol is found to be represented by 0x80, then the encoding is Windows-1252, if it is represented by 0xA4 it is ISO-8859-15, and if it is not represented by either of these codes it is likely to be UTF-8, represented by two bytes, 0xAC and 0x20. The four encodings discussed in this paragraph are the most likely candidates for "latin" text, although for example Hingarian, Irish and Welsh may require UTF-8 or a different encoding for a full character set.

Finally, two recent examples where I have had to use an encoding in the Calibre recipe which is different from the explicit encoding in the HTML input (and in the first case from the encoding returned in the HTTP headers):

(1) http://www.ladepeche.fr

http://www.ladepeche.fr/article/2011...o-du-c-ur.html

The HTML source explicitly claims ISO-8859-1, and the HTTP headers also claim ISO-8859-1. The text however contains "œ" [o and e run together as one character, in case this character does not display correctly in your browser]. This character is not included in ISO-8859-1. The actual encoding is Windows-1252, where the character is encoded as 0x9C. the Calibre recipe needs " encoding = 'Windows-1252' " in order to get the correct character displayed in the e-book.

(2) http://www.independent.ie

http://www.independent.ie/national-n...e-2595418.html

The HTML source explicitly claims "charset=utf-8". The HTTP headers shown in FireFox indicate ISO-8859-1. The Calibre built-in recipe does not specify an encoding, so UTF-8 will be used. In this case, although the Euro symbol appears in the text, encoding is not a problem as it appears as a HTML entity, not as a single byte code. The pound (£) sign however also appears in the text, and as the single byte ISO-8859-1 code. This however is not the appropriate UTF-8 encoding, and gives an invalid UTF-8 byte sequence, so Calibre follows the rules for a UTF-8 decoder and replaces the pound sign by the "replacement character" � [white question mark on black diamond background if your browser does not display it correctly] to indicate the invalid UTF-8 byte encountered. " encoding = 'ISO-8859-1' " is needed in the recipe to obtain the correct display in the generated e-book. (In fact, since the Euro symbol appears as a HTML entity, it is possible that the encoding should be either Windows-1252 or ISO-8859-15. If at some point the Euro symbol appears as a single byte code rather than a HTML entity it may become necessary to specify one of these two encodings instead of ISO-8859-1, depending on which single byte encoding is used).

Both the example URLs just given above are still live. When they disappear it will become necessary to browse for other pages at the two newspapers to find similar examples.

oneillpt · 03-31-2011, 03:42 PM

I've posted some further information relevant to the inter.it recipe at https://www.mobileread.com/forums/sho...36#post1471536, in a new thread, as it illustrates a different aspect which has relevance to other possible recipes.

03-24-2011, 04:57 PM	#1
Sciamano Member Posts: 23 Karma: 10 Join Date: Mar 2011 Device: Kindle 3	Reciper for www.inter.it - some letters are wrong! Hi everyone, I'm new to the boards and need some help with a recipe. I'm going to be a commuter soon, so I wanted to create a recipe to download all the news that get published on my favorite (italian) soccer team's website. This is the link to the RSS feed: http://veleno.inter.it/aas/rss/index_full_it.xml I've created this very simple custom recipe: Code: class AdvancedUserRecipe1300997108(BasicNewsRecipe): title = u'Inter' oldest_article = 7 max_articles_per_feed = 100 feeds = [(u'Inter News', u'http://veleno.inter.it/aas/rss/index_full_it.xml')] remove_tags = [dict(name='div', attrs={'class':'piccolowww'})] It seems to work fine, except for one little thing: where the article starts, and the date (day of the week, date, time) of the article is written, some letters in the ebook are changed. For example, this is what it should read for today's news: Giovedì, 24 Marzo 2011 14:44:03 But this is what I find in the resulting eBook: Giovedě, 24 Marzo 2011 14:44:03 (see? the "ì" has been transformed to "ě") Not a big deal, I can live with that, but since I'm a perfectionist, I'd like to solve. Also if someone helps me remove the rss logo images and "permalink" link after the date, it would be great! I've tried but was not succesful. Thanks!!

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
The Robbery: A Short Story that goes wrong for all the wrong reasons	brinlingfm	Self-Promotions by Authors and Publishers	0	03-22-2011 08:20 AM
Dates wrong on scheduled news: what am I doing wrong?	Rod Laird	Calibre	5	11-05-2010 06:06 PM

03-24-2011, 06:42 PM	#3
Sciamano Member Posts: 23 Karma: 10 Join Date: Mar 2011 Device: Kindle 3	Thanks!! It worked! Karma+! Any chance you can help me with removing the rss links and permalink after the date? That would be awesome!

03-25-2011, 08:24 AM	#5
Sciamano Member Posts: 23 Karma: 10 Join Date: Mar 2011 Device: Kindle 3	Thanks, I'll put ISO-8859-1 then.

03-25-2011, 09:35 AM	#7
Sciamano Member Posts: 23 Karma: 10 Join Date: Mar 2011 Device: Kindle 3	I'll specify ISO-8859-15 until the Euro symbol comes up (but I doubt it ever will)

03-31-2011, 03:42 PM	#15
oneillpt Connoisseur Posts: 63 Karma: 46 Join Date: Feb 2011 Device: Kindle 3 (cracked screen!); PW1; Oasis	I've posted some further information relevant to the inter.it recipe at https://www.mobileread.com/forums/sho...36#post1471536, in a new thread, as it illustrates a different aspect which has relevance to other possible recipes.