Updated Irish Times recipe?

leo738 · 03-12-2013, 06:24 AM

Hello All,

The Irish Times website has recently been updated over the last weekend & following that the recipe seems to be broken. Anybody come up with an update?

Thanks,

Leo

oneillpt · 03-12-2013, 08:38 AM

Quote:

Originally Posted by leo738

Hello All,

The Irish Times website has recently been updated over the last weekend & following that the recipe seems to be broken. Anybody come up with an update?

Thanks,

Leo

The following are the essential changes to get content extracted again:

Code:

encoding  = 'UTF-8'

instead of

Code:

encoding  = 'ISO-8859-15'

Code:

keep_only_tags  = dict(name='article', attrs={'class':'article row'})

instead of any existing keep_only_tags

Code:

remove_tags    = [dict(name='div', attrs={'class':'topics_holder'}),
                  dict(name='div', attrs={'class':'social_article_share'})]

instead of any existing remove_tags.

I'm not posting a complete recipe - mine is rather heavily customised to extract only new articles, but extract all on one chosen day each week.

It looks as if there may be some further changes needed related to the chosen feeds, and I'll add another post here if I find further changes needed, but the changes above should get things going again for now.

oneillpt · 03-12-2013, 11:07 AM

Quote:

Originally Posted by oneillpt

It looks as if there may be some further changes needed related to the chosen feeds.

The old RSS feeds in some cases now are redirected to new feeds, and in other cases simply fail. "frontpage", "ireland" and "world" are all redirected to a "news" feed, which for me only extracts when redirected from "frontpage". "finance", "features", "sport" and "opinion" seem to extract still. "letters" redirects to a new "Debate" feed, which no longer contains the letters, and the redirection does not seem to extract (I viewed the feed in browser).

I no longer see a set of RSS feeds listed as before. These may now be in the process of being phased out in favour of RSS feeds tied to the subscription ePaper - the "Quick User Guide" for the "Newspaper replica view" on the Subscription/Epaper page has an item "Click on [icon] to create an RSS feed to the front page or entire newspaper".

With home delivery of the printed paper already I'm not going to subscribe to the ePaper as well. If I find a stable set of feeds which continue to work in Calibre, I'll post again on this thread. Otherwise it will be a case of availing of the offer of temporary ePaper subscription in place of home delivery when on holiday, which I hope well continue.

leo738 · 03-13-2013, 07:27 AM

Yes, indeed I've tried your suggested fix (full recipe below) but unfortunately it's still unusable. I see from an article on the Irish Times website they are still tweaking it.

However I wonder will a fix be possible?

Leo

Code:

__license__  = 'GPL v3'
__copyright__ = "2008, Derry FitzGerald. 2009 Modified by Ray Kinsella and David O'Callaghan, 2011 Modified by Phil Burns"
'''
irishtimes.com
'''
import re

from calibre.web.feeds.news import BasicNewsRecipe

class IrishTimes(BasicNewsRecipe):
    title          = u'The Irish Times'
    encoding  = 'ISO-8859-15'
    __author__    = "Derry FitzGerald, Ray Kinsella, David O'Callaghan and Phil Burns"
    language = 'en_IE'
    timefmt = ' (%A, %B %d, %Y)'


    oldest_article = 1.0
    max_articles_per_feed  = 100
    no_stylesheets = True
    simultaneous_downloads= 5

    r = re.compile('.*(?P<url>http:\/\/(www.irishtimes.com)|(rss.feedsportal.com\/c)\/.*\.html?).*')
    remove_tags    = [dict(name='div', attrs={'class':'footer'})]
    extra_css      = 'p, div { margin: 0pt; border: 0pt; text-indent: 0.5em } .headline {font-size: large;} \n .fact { padding-top: 10pt  }'

    feeds          = [
                      ('Frontpage', 'http://www.irishtimes.com/feeds/rss/newspaper/index.rss'),
                      ('Ireland', 'http://www.irishtimes.com/feeds/rss/newspaper/ireland.rss'),
                      ('World', 'http://www.irishtimes.com/feeds/rss/newspaper/world.rss'),
                      ('Finance', 'http://www.irishtimes.com/feeds/rss/newspaper/finance.rss'),
                      ('Features', 'http://www.irishtimes.com/feeds/rss/newspaper/features.rss'),
                      ('Sport', 'http://www.irishtimes.com/feeds/rss/newspaper/sport.rss'),
                      ('Opinion', 'http://www.irishtimes.com/feeds/rss/newspaper/opinion.rss'),
                      ('Letters', 'http://www.irishtimes.com/feeds/rss/newspaper/letters.rss'),
                      ('Magazine', 'http://www.irishtimes.com/feeds/rss/newspaper/magazine.rss'),
                      ('Health', 'http://www.irishtimes.com/feeds/rss/newspaper/health.rss'),
                      ('Education & Parenting', 'http://www.irishtimes.com/feeds/rss/newspaper/education.rss'),
                      ('Motors', 'http://www.irishtimes.com/feeds/rss/newspaper/motors.rss'),
                      ('An Teanga Bheo', 'http://www.irishtimes.com/feeds/rss/newspaper/anteangabheo.rss'),
                      ('Commercial Property', 'http://www.irishtimes.com/feeds/rss/newspaper/commercialproperty.rss'),
                      ('Science Today', 'http://www.irishtimes.com/feeds/rss/newspaper/sciencetoday.rss'),
                      ('Property', 'http://www.irishtimes.com/feeds/rss/newspaper/property.rss'),
                      ('The Tickets', 'http://www.irishtimes.com/feeds/rss/newspaper/theticket.rss'),
                      ('Weekend', 'http://www.irishtimes.com/feeds/rss/newspaper/weekend.rss'),
                      ('News features', 'http://www.irishtimes.com/feeds/rss/newspaper/newsfeatures.rss'),
                      ('Obituaries', 'http://www.irishtimes.com/feeds/rss/newspaper/obituaries.rss'),
                    ]


    def print_version(self, url):
        if url.count('rss.feedsportal.com'):
            #u = url.replace('0Bhtml/story01.htm','_pf0Bhtml/story01.htm')
            u = url.find('irishtimes')
            u = 'http://www.irishtimes.com' + url[u + 12:]
            u = u.replace('0C', '/')
            u = u.replace('A', '')
            u = u.replace('0Bhtml/story01.htm', '_pf.html')
        else:
            u = url.replace('.html','_pf.html')
        return u

    def get_article_url(self, article):
        return article.link

frisket · 03-13-2013, 06:21 PM

Quote:

Originally Posted by oneillpt

The following are the essential changes to get content extracted again:

The only place I can find the recipe is in /opt/calibre/resources/builtin_recipes.zip

Is that really where it's kept? Or should there be a disk file for irish_times?

Quote:

Originally Posted by oneillpt

Code:

encoding  = 'UTF-8'

instead of

Code:

encoding  = 'ISO-8859-15'

Code:

keep_only_tags  = dict(name='article', attrs={'class':'article row'})

instead of any existing keep_only_tags

I didn't find any keep_only_tags.

Quote:

Originally Posted by oneillpt

Code:

remove_tags    = [dict(name='div', attrs={'class':'topics_holder'}),
                  dict(name='div', attrs={'class':'social_article_share'})]

instead of any existing remove_tags.

That was too easy :-)

Thanks for the pointers!

frisket · 03-13-2013, 06:37 PM

Quote:

Originally Posted by leo738

Yes, indeed I've tried your suggested fix (full recipe below) but unfortunately it's still unusable. I see from an article on the Irish Times website they are still tweaking it.

However I wonder will a fix be possible?

It sounds as if we need to wait until it settles down. I went through the log file of an attempt just now, using the fixes oneillpt posted, and there are dozens of broken links (RSS feeds that no longer exist). It should, with some effort, be possible to identify them by inspection, and find the equivalent (or not) on the new web site.

However, given the Irish news industry's ignorance of the Internet, and linking in particular, I wouldn't hold out too much hope that they will actually expose feeds for much longer, as they don't seem to want people to link to them.

///Peter

leo738 · 03-14-2013, 06:46 AM

Extract from:

http://oldbugs.calibre-ebook.com/wiki/RecipeTips

NOTE: you are strongly advised NOT to edit the built-in recipes directly from the recipes folder!

The second method is the recommended one and here is how you go about it.

In the main window of calibre click the little arrow next to the "Fetch News" button and then click on "Add a custom news source".
A new window opens up and on the bottom left corner click on "Customize builtin recipe".
Now a little window opens up with a drop down box where you can pick the recipe of the news scource you wish to customize.
Once you have chosen a particular news source it should appear in the list on the left column of the window.
Select it in the left column and the recipe will appear on the right column of the window.

leo738 · 03-30-2013, 09:18 AM

Hello All,

I had a look at some of the links & it's possible to get the recipe working, but it's not as extensive as the previous version, missing the magazine & lots of other sections. It's a shame but at least it's something. The only sections are now:

News
Business
Debate
Life Style
Culture
Sport

I notice the links contain numbers at the end which may be subject to change, will have to wait & see!

Here's the recipe:

Code:

__license__  = 'GPL v3'
__copyright__ = "2008, Derry FitzGerald. 2009 Modified by Ray Kinsella and David O'Callaghan, 2011 Modified by Phil Burns, 2013 Modified by O. O'H"
'''
irishtimes.com
'''
import re

from calibre.web.feeds.news import BasicNewsRecipe

class IrishTimes(BasicNewsRecipe):
    title          = u'The Irish Times'
    encoding  = 'ISO-8859-15'
    __author__    = "Derry FitzGerald, Ray Kinsella, David O'Callaghan, Phil Burns & O. O'H"
    language = 'en_IE'
    timefmt = ' (%A, %B %d, %Y)'

    oldest_article = 1.0
    max_articles_per_feed  = 100
    no_stylesheets = True
    simultaneous_downloads= 5

    r = re.compile('.*(?P<url>http:\/\/(www.irishtimes.com)|(rss.feedsportal.com\/c)\/.*\.html?).*')
    keep_only_tags  = dict(name='article', attrs={'class':'article row'})
    remove_tags    = [dict(name='div', attrs={'class':'footer'})]
    extra_css      = 'p, div { margin: 0pt; border: 0pt; text-indent: 0.5em } .headline {font-size: large;} \n .fact { padding-top: 10pt  }'

    feeds          = [
 		  			  ('News', 'http://www.irishtimes.com/cmlink/the-irish-times-news-1.1319192'),
                      ('Business', 'http://www.irishtimes.com/cmlink/the-irish-times-business-1.1319195'),
                      ('Debate', 'http://www.irishtimes.com/cmlink/the-irish-times-news-1.1319192'),
                      ('Life Style', 'http://www.irishtimes.com/cmlink/the-irish-times-life-style-1.1319214'),
                      ('Culture', 'http://www.irishtimes.com/cmlink/the-irish-times-culture-1.1319213'),
                      ('Sport', 'http://www.irishtimes.com/cmlink/the-irish-times-sport-1.1319194'),
                    ]

    def print_version(self, url):
        if url.count('rss.feedsportal.com'):
            #u = url.replace('0Bhtml/story01.htm','_pf0Bhtml/story01.htm')
            u = url.find('irishtimes')
            u = 'http://www.irishtimes.com' + url[u + 12:]
            u = u.replace('0C', '/')
            u = u.replace('A', '')
            u = u.replace('0Bhtml/story01.htm', '_pf.html')
        else:
            u = url.replace('.html','_pf.html')
        return u

    def get_article_url(self, article):
        return article.link

smooth · 03-30-2013, 10:03 AM

Thanks very much for the update. Just some feedback for you.

I've tried it with the Kindle 3 and within stories, apostrophes and quote marks get screwed up. They get replaced by a combination of â and then two question marks, each in a box.

The á in Tánaiste also gets screwed up, but á doesn't get printed much, and anyway, it's only the Tánaiste.

leo738 · 04-01-2013, 07:14 AM

Hello,

I stand to be corrected but I think it's something to do with the encoding:

Code:

encoding  = 'ISO-8859-15'

You might try instead:

Code:

encoding  = 'UTF-8'

As per the 2nd post. Hopefully it solves the problem. The recipe following this change would therefore be:

Code:

__license__  = 'GPL v3'
__copyright__ = "2008, Derry FitzGerald. 2009 Modified by Ray Kinsella and David O'Callaghan, 2011 Modified by Phil Burns, 2013 Modified by O. O'H"
'''
irishtimes.com
'''
import re

from calibre.web.feeds.news import BasicNewsRecipe

class IrishTimes(BasicNewsRecipe):
    title          = u'The Irish Times'
    encoding  = 'UTF-8'
    __author__    = "Derry FitzGerald, Ray Kinsella, David O'Callaghan, Phil Burns & O. O'H"
    language = 'en_IE'
    timefmt = ' (%A, %B %d, %Y)'

    oldest_article = 1.0
    max_articles_per_feed  = 100
    no_stylesheets = True
    simultaneous_downloads= 5

    r = re.compile('.*(?P<url>http:\/\/(www.irishtimes.com)|(rss.feedsportal.com\/c)\/.*\.html?).*')
    keep_only_tags  = dict(name='article', attrs={'class':'article row'})
    remove_tags    = [dict(name='div', attrs={'class':'footer'})]
    extra_css      = 'p, div { margin: 0pt; border: 0pt; text-indent: 0.5em } .headline {font-size: large;} \n .fact { padding-top: 10pt  }'

    feeds          = [
 		      ('News', 'http://www.irishtimes.com/cmlink/the-irish-times-news-1.1319192'),
                      ('Business', 'http://www.irishtimes.com/cmlink/the-irish-times-business-1.1319195'),
                      ('Debate', 'http://www.irishtimes.com/cmlink/the-irish-times-news-1.1319192'),
                      ('Life Style', 'http://www.irishtimes.com/cmlink/the-irish-times-life-style-1.1319214'),
                      ('Culture', 'http://www.irishtimes.com/cmlink/the-irish-times-culture-1.1319213'),
                      ('Sport', 'http://www.irishtimes.com/cmlink/the-irish-times-sport-1.1319194'),
                    ]

    def print_version(self, url):
        if url.count('rss.feedsportal.com'):
            #u = url.replace('0Bhtml/story01.htm','_pf0Bhtml/story01.htm')
            u = url.find('irishtimes')
            u = 'http://www.irishtimes.com' + url[u + 12:]
            u = u.replace('0C', '/')
            u = u.replace('A', '')
            u = u.replace('0Bhtml/story01.htm', '_pf.html')
        else:
            u = url.replace('.html','_pf.html')
        return u

    def get_article_url(self, article):
        return article.link

leo738 · 04-01-2013, 08:13 AM

Looks like the photos & headlines could do with being resized, anybody know a solution?

Leo

03-12-2013, 06:24 AM	#1
leo738 Enthusiast Posts: 39 Karma: 10 Join Date: Jul 2011 Device: Kindle 3	Updated Irish Times recipe? Hello All, The Irish Times website has recently been updated over the last weekend & following that the recipe seems to be broken. Anybody come up with an update? Thanks, Leo

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Irish Times - Recipe Problem	leo738	Recipes	10	08-31-2011 12:15 PM
Irish Times Recipe problem	mbro	Recipes	3	04-16-2011 08:11 AM
Modified Irish Times Recipe	phiznlil	Recipes	2	04-01-2011 06:27 AM
Updated New York Times recipe	nickredding	Recipes	2	11-20-2010 10:53 AM
Irish Times recipe - no longer working	patrickpc	Recipes	1	11-17-2010 12:16 PM

03-14-2013, 06:46 AM	#7
leo738 Enthusiast Posts: 39 Karma: 10 Join Date: Jul 2011 Device: Kindle 3	Extract from: http://oldbugs.calibre-ebook.com/wiki/RecipeTips NOTE: you are strongly advised NOT to edit the built-in recipes directly from the recipes folder! The second method is the recommended one and here is how you go about it. In the main window of calibre click the little arrow next to the "Fetch News" button and then click on "Add a custom news source". A new window opens up and on the bottom left corner click on "Customize builtin recipe". Now a little window opens up with a drop down box where you can pick the recipe of the news scource you wish to customize. Once you have chosen a particular news source it should appear in the list on the left column of the window. Select it in the left column and the recipe will appear on the right column of the window.

03-30-2013, 10:03 AM	#9
smooth Junior Member Posts: 1 Karma: 10 Join Date: Mar 2013 Device: Kindle	Thanks very much for the update. Just some feedback for you. I've tried it with the Kindle 3 and within stories, apostrophes and quote marks get screwed up. They get replaced by a combination of â and then two question marks, each in a box. The á in Tánaiste also gets screwed up, but á doesn't get printed much, and anyway, it's only the Tánaiste.

04-01-2013, 08:13 AM	#11
leo738 Enthusiast Posts: 39 Karma: 10 Join Date: Jul 2011 Device: Kindle 3	Looks like the photos & headlines could do with being resized, anybody know a solution? Leo