Quote:
Originally Posted by Tragos
This recipe is not working anymore as Helsingin Sanomat has changed their website structure. Nowadays the print versions of the pages are created using JavaScript.
|
Here is a revised version, which extracts the main news (Uutiset) section. However, the book (Kirjat) and cinema (Elokuvat) sections, which were still being extracted by the original version are broken by this revision.
Spoiler:
Code:
class AdvancedUserRecipe1298137661(BasicNewsRecipe):
title = u'Helsingin Sanomat'
__author__ = 'oneillpt custom'
language = 'fi'
oldest_article = 7
max_articles_per_feed = 100
no_stylesheets = True
remove_javascript = True
conversion_options = {
'linearize_tables' : True
}
#remove_tags = [
# dict(name='a', attrs={'id':'articleCommentUrl'}),
# dict(name='p', attrs={'class':'newsSummary'}),
# dict(name='div', attrs={'class':'headerTools'})
# ]
keep_only_tags = [dict(name='div', attrs={'id':'main-content'})]
feeds = [(u'Uutiset - HS.fi', u'http://www.hs.fi/uutiset/rss/')
#, (u'Politiikka - HS.fi', u'http://www.hs.fi/politiikka/rss/'),
# (u'Ulkomaat - HS.fi', u'http://www.hs.fi/ulkomaat/rss/'), #(u'Kulttuuri - HS.fi', u'http://www.hs.fi/kulttuuri/rss/'),
# (u'Kirjat - HS.fi', u'http://www.hs.fi/kulttuuri/kirjat/rss/'), #(u'Elokuvat - HS.fi', u'http://www.hs.fi/kulttuuri/elokuvat/rss/')
]
#def print_version(self, url):
# j = url.rfind("/")
# s = url[j:]
# i = s.rfind("?ref=rss")
# if i > 0:
# s = s[:i]
# return "http://www.hs.fi/tulosta" + s
The revision is made by removing the remove_tags lines, adding a keep_only_tags line, and removing the print_version definition. I have retained the removed lines as comments, and commented the feeds which are not working now. I'll post a new version if I can make these feeds work with the same recipe which now works for the main news feed.