![]() |
#1 |
Member
![]() Posts: 23
Karma: 10
Join Date: Mar 2011
Device: Kindle 3
|
Reciper for www.inter.it - some letters are wrong!
Hi everyone,
I'm new to the boards and need some help with a recipe. I'm going to be a commuter soon, so I wanted to create a recipe to download all the news that get published on my favorite (italian) soccer team's website. This is the link to the RSS feed: http://veleno.inter.it/aas/rss/index_full_it.xml I've created this very simple custom recipe: Code:
class AdvancedUserRecipe1300997108(BasicNewsRecipe): title = u'Inter' oldest_article = 7 max_articles_per_feed = 100 feeds = [(u'Inter News', u'http://veleno.inter.it/aas/rss/index_full_it.xml')] remove_tags = [dict(name='div', attrs={'class':'piccolowww'})] For example, this is what it should read for today's news: Giovedì, 24 Marzo 2011 14:44:03 But this is what I find in the resulting eBook: Giovedě, 24 Marzo 2011 14:44:03 (see? the "ì" has been transformed to "ě") Not a big deal, I can live with that, but since I'm a perfectionist, I'd like to solve. Also if someone helps me remove the rss logo images and "permalink" link after the date, it would be great! I've tried but was not succesful. Thanks!! |
![]() |
![]() |
![]() |
#2 | |
Connoisseur
![]() Posts: 63
Karma: 46
Join Date: Feb 2011
Device: Kindle 3 (cracked screen!); PW1; Oasis
|
Quote:
Code:
class AdvancedUserRecipe1300997108(BasicNewsRecipe): title = u'Inter' encoding = 'ISO-8859-15' oldest_article = 7 max_articles_per_feed = 100 feeds = [(u'Inter News', u'http://veleno.inter.it/aas/rss/index_full_it.xml')] remove_tags = [dict(name='div', attrs={'class':'piccolowww'})] |
|
![]() |
![]() |
Advert | |
|
![]() |
#3 |
Member
![]() Posts: 23
Karma: 10
Join Date: Mar 2011
Device: Kindle 3
|
Thanks!! It worked! Karma+!
Any chance you can help me with removing the rss links and permalink after the date? That would be awesome! ![]() |
![]() |
![]() |
![]() |
#4 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
FYI, the site seems to specify encoding = 'ISO-8859-1', not 'ISO-8859-15'. The difference is subtle, but they differ in a few spots, particularly as to the Euro symbol. If he sees a missing Euro symbol, that's why.
|
![]() |
![]() |
![]() |
#5 |
Member
![]() Posts: 23
Karma: 10
Join Date: Mar 2011
Device: Kindle 3
|
Thanks, I'll put ISO-8859-1 then.
|
![]() |
![]() |
Advert | |
|
![]() |
#6 | |
Connoisseur
![]() Posts: 63
Karma: 46
Join Date: Feb 2011
Device: Kindle 3 (cracked screen!); PW1; Oasis
|
Quote:
Code:
<?xml version="1.0" encoding="ISO-8859-15"?> |
|
![]() |
![]() |
![]() |
#7 |
Member
![]() Posts: 23
Karma: 10
Join Date: Mar 2011
Device: Kindle 3
|
I'll specify ISO-8859-15 until the Euro symbol comes up (but I doubt it ever will)
![]() |
![]() |
![]() |
![]() |
#8 | |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Quote:
Code:
Content-Type: text/html; charset=ISO-8859-1 In fact, part of the reason I looked was to see how you might have come up with the answer. I don't often have character encoding problems (mostly I work in English), so I was wondering if you'd found your answer in source or HTTP headers. Last edited by Starson17; 03-25-2011 at 09:57 AM. |
|
![]() |
![]() |
![]() |
#9 | |
Connoisseur
![]() Posts: 63
Karma: 46
Join Date: Feb 2011
Device: Kindle 3 (cracked screen!); PW1; Oasis
|
Quote:
Code:
<?xml version="1.0" encoding="ISO-8859-15"?> <rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:syn="http://purl.org/rss/1.0/modules/syndication/" xmlns:atom="http://www.w3.org/2005/Atom" > <channel> <title>INTER.IT - IT full</title> <link>http://www.inter.it/</link> <language>it</language> <description>Le notizie ufficiali di inter.it</description> <copyright>Copyright 2010 Football Club Internazionale Milano Spa</copyright> <atom:link href="http://www.inter.it/aas/rss/index_full_it.xml" rel="self" type="application/rss+xml" /> <item> <dc:date>2011-03-26T00:10:23+01:00</dc:date><title>Inter Channel: "7 su 7" e non solo...</title> <description><![CDATA[<img src="http://www.inter.it/aas/img/143867.jpg"><br><br><p><strong>APPIANO GENTILE</strong> - Non perdere gli appuntamenti odierni con il canale tematico nerazzurro: si comincia con la <em>Rassegna Stampa</em>, alle ore 13.30, a cura di Nagaja Beccalossi, mentre alle 19.30 l'appuntamento è con <em>Internews</em>, in studio Alessandro Villa.</p> <p>Inoltre, alle 17 e in replica alle 23, torna "7 su 7", la rubrica a cura della redazione che ci riassume i fatti principali dal 19 marzo ad oggi.</p> <p> </p><br><br>]]></description> <link>http://www.inter.it/aas/news/reader?N=52072&L=it</link> <guid>http://www.inter.it/aas/news/reader?N=52072&L=it</guid> </item> <item> <dc:date>2011-03-25T23:03:34+01:00</dc:date><title>Thiago Motta: "Grazie Italia, così sono felice"</title> ... I'm more inclined to trust the explicit declaration as ISO-8859-15 in the initial page, and to assume that subsequent pages will have been produced using the same encoding. The encoding reported in the headers will depend on the server configuration, and may or may not be reliable. |
|
![]() |
![]() |
![]() |
#10 | |
Connoisseur
![]() Posts: 63
Karma: 46
Join Date: Feb 2011
Device: Kindle 3 (cracked screen!); PW1; Oasis
|
Quote:
Code:
class AdvancedUserRecipe1300997108(BasicNewsRecipe): title = u'Inter' encoding = 'ISO-8859-15' oldest_article = 7 max_articles_per_feed = 100 feeds = [(u'Inter News', u'http://veleno.inter.it/aas/rss/index_full_it.xml')] remove_tags = [dict(name='div', attrs={'class':'piccolowww'})] remove_tags = [dict(name='span', attrs={'style':'padding-left:120px;'})] |
|
![]() |
![]() |
![]() |
#11 | |
Member
![]() Posts: 23
Karma: 10
Join Date: Mar 2011
Device: Kindle 3
|
Quote:
I asked here because I'm a recipe newbie, and thought that there might have been a better way to achieve this, which I couldn't find. If this is the only solution available, I'll keep those images and permalink, it's not a big deal. It was more a curiosity. Thanks, though! |
|
![]() |
![]() |
![]() |
#12 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
|
![]() |
![]() |
![]() |
#13 | |
Member
![]() Posts: 23
Karma: 10
Join Date: Mar 2011
Device: Kindle 3
|
Quote:
Thanks anyway. ![]() |
|
![]() |
![]() |
![]() |
#14 | |
Connoisseur
![]() Posts: 63
Karma: 46
Join Date: Feb 2011
Device: Kindle 3 (cracked screen!); PW1; Oasis
|
Quote:
First, why do I suggest ISO-8859-15 is more likely that ISO-8859-1? Quite simply, for countries within the Euro zone, such as Italy, the Euro symbol is likely to occur in news text. The Euro sign and a few characters used in Finnish and French are missing in ISO-8859-1. ISO-8859-15 updates ISO-8859-1 by introducing these characters, replacing a few infrequently used characters in ISO-8859-1. So any Finnish or French text which may require the characters missing from ISO-8859-1, or text using the Euro symbol, is likely to be ISO-8859-15, Windows-1252 or UTF-8, not ISO-8859-1, even if it explicitly claims to be ISO-8859-1 or arrives with HTTP headers claiming ISO-8859-1. If fact, if it uses the Euro symbol, it cannot be ISO-8859-1. Then why Windows-1252? Text encodes as Windows-1252 is often wrongly described as ISO-8859-1. Windows-1252 is a superset of ISO-8859-1 which includes the additional characters introduced in ISO-8859-15, but mapped differently, to byte codes which were unused in ISO-8859-1, rather than replacing infrequently characters in ISO-8859-1. For example, the Euro symbol is mapped to 0x80 in Windows-1252, and to 0xA4 in ISO-8859-15, where it causes an infrequently used character to be dropped. A hex editor can be used to check whether the encoding is Windows-1252 or ISO-8859-15 if any of these additional characters is present. For example, if the Euro symbol is found to be represented by 0x80, then the encoding is Windows-1252, if it is represented by 0xA4 it is ISO-8859-15, and if it is not represented by either of these codes it is likely to be UTF-8, represented by two bytes, 0xAC and 0x20. The four encodings discussed in this paragraph are the most likely candidates for "latin" text, although for example Hingarian, Irish and Welsh may require UTF-8 or a different encoding for a full character set. Finally, two recent examples where I have had to use an encoding in the Calibre recipe which is different from the explicit encoding in the HTML input (and in the first case from the encoding returned in the HTTP headers): (1) http://www.ladepeche.fr http://www.ladepeche.fr/article/2011...o-du-c-ur.html The HTML source explicitly claims ISO-8859-1, and the HTTP headers also claim ISO-8859-1. The text however contains "œ" [o and e run together as one character, in case this character does not display correctly in your browser]. This character is not included in ISO-8859-1. The actual encoding is Windows-1252, where the character is encoded as 0x9C. the Calibre recipe needs " encoding = 'Windows-1252' " in order to get the correct character displayed in the e-book. (2) http://www.independent.ie http://www.independent.ie/national-n...e-2595418.html The HTML source explicitly claims "charset=utf-8". The HTTP headers shown in FireFox indicate ISO-8859-1. The Calibre built-in recipe does not specify an encoding, so UTF-8 will be used. In this case, although the Euro symbol appears in the text, encoding is not a problem as it appears as a HTML entity, not as a single byte code. The pound (£) sign however also appears in the text, and as the single byte ISO-8859-1 code. This however is not the appropriate UTF-8 encoding, and gives an invalid UTF-8 byte sequence, so Calibre follows the rules for a UTF-8 decoder and replaces the pound sign by the "replacement character" � [white question mark on black diamond background if your browser does not display it correctly] to indicate the invalid UTF-8 byte encountered. " encoding = 'ISO-8859-1' " is needed in the recipe to obtain the correct display in the generated e-book. (In fact, since the Euro symbol appears as a HTML entity, it is possible that the encoding should be either Windows-1252 or ISO-8859-15. If at some point the Euro symbol appears as a single byte code rather than a HTML entity it may become necessary to specify one of these two encodings instead of ISO-8859-1, depending on which single byte encoding is used). Both the example URLs just given above are still live. When they disappear it will become necessary to browse for other pages at the two newspapers to find similar examples. |
|
![]() |
![]() |
![]() |
#15 |
Connoisseur
![]() Posts: 63
Karma: 46
Join Date: Feb 2011
Device: Kindle 3 (cracked screen!); PW1; Oasis
|
I've posted some further information relevant to the inter.it recipe at https://www.mobileread.com/forums/sho...36#post1471536, in a new thread, as it illustrates a different aspect which has relevance to other possible recipes.
|
![]() |
![]() |
![]() |
Tags |
calcio, inter, recipe, world champions |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
The Robbery: A Short Story that goes wrong for all the wrong reasons | brinlingfm | Self-Promotions by Authors and Publishers | 0 | 03-22-2011 08:20 AM |
Dates wrong on scheduled news: what am I doing wrong? | Rod Laird | Calibre | 5 | 11-05-2010 06:06 PM |