Recipes for RDS.ca, TSN.ca and TheHockeynews.com

Nexus · 11-16-2010, 10:53 AM

Hey guys and gals,

I thought I would sort it out with all the FAQs and great tutorials around here (I've read a lot before posting), but I just can't build my own recipes once it gets a bit complicated. I've been working on these 3 sites for a couple of days, and I just don't find a way to retrieve the news correctly. If someone could help me, I'd be forever obliged. Thanks in advance.

http://www.rds.ca/hockey/fildepresse_rds.xml

http://www.tsn.ca/datafiles/rss/Stories.xml

http://www.thehockeynews.com/rss/all_categories.xml

somedayson · 11-17-2010, 02:09 AM

Not much help, but I use this to get the daily hockey headlines from thn:

class AdvancedUserRecipe1283848394(BasicNewsRecipe):
title = u'Hockey News'
oldest_article = 1
max_articles_per_feed = 100

feeds = [(u'Hockey News', u'http://www.thehockeynews.com/rss/9-Headlines.xml')][/QUOTE][/QUOTE]

Nexus · 11-17-2010, 12:35 PM

Thanks somedayson, for whatever reason the RSS link I tried the other day for THN didn't work. It's working fine now. I've worked a bit on and I got some results, except that I have big time difficulties to remove tags that come after the article, I tried "stuff" like remove_tags_after/before etc, with no success unfortunately.

Here's the recipe, not looking pretty I know, but that's all I could come up with with my knowledge. You have no idea how long it took me...

Code:

class AdvancedUserRecipe1289990851(BasicNewsRecipe):
    title          = u'THE HOCKEY NEWS'
    oldest_article = 7
    max_articles_per_feed = 5
    no_stylesheets = True
    remove_tags = [dict(name='div', attrs={'class':'article_info'}),
                            dict(name='div', attrs={'class':'photo_details'}),
                            dict(name='div', attrs={'id':'comments_container'}),		  
                            dict(name='div', attrs={'id':'add_comment'}),
			    dict(name='div', attrs={'id':'legal_info'}),
		            dict(name='div', attrs={'id':'breadcrumb'}),
   			    dict(name='div', attrs={'id':'site_header'}),
			    dict(name='div', attrs={'id':'site_navigation'}),
			    dict(name='div', attrs={'id':'advertisement'}),
			    dict(name='div', attrs={'class':'tool_menu'})]
				   
    feeds          = [(u'THN', u'http://www.thehockeynews.com/rss/all_categories.xml')]

Nexus · 11-17-2010, 01:11 PM

By the way here's the error message I get for TSN (I used the basic recipe)

Spoiler:

Quote:

Parsing index.html ...
Parsing feed_0/article_4/index.html ...
Referenced file '/nhl/teams/players/%3fname%3djordan%2bleopold' not found
Referenced file '/nhl/teams/players/%3fname%3dviktor%2bstalberg' not found
Referenced file '/nhl/teams/players/%3fname%3dalexander%2bmogilny' not found
Referenced file '/nhl/teams/players/%3fname%3dtyler%2bmyers' not found
Referenced file '/tsn_fantasy' not found
Referenced file '/nhl/teams/players/%3fname%3djay%2bbouwmeester' not found
Referenced file '/nhl/teams/players/%3fname%3dalexander%2bedler' not found
Referenced file '/nhl/teams/players/%3fname%3ddaniel%2bsedin' not found
Referenced file '/nhl/teams/players/%3fname%3dfrancis%2bbouillon' not found
Referenced file '/nhl/teams/players/%3fname%3drandy%2bjones' not found
Referenced file '/nhl/teams/players/%3fname%3dsean%2bavery' not found
Referenced file '/nhl/teams/players/%3fname%3dray%2bwhitney' not found
Referenced file '/nhl/teams/players/%3fname%3dcory%2bschneider' not found
Referenced file '/nhl/teams/players/%3fname%3dtheo%2bpeckham' not found
Referenced file '/nhl/teams/players/%3fname%3dbrett%2bclark' not found
Referenced file '/nhl/teams/players/%3fname%3dcam%2bward' not found
Referenced file '/nhl/teams/players/%3fname%3ddaniel%2balfredsson' not found
Referenced file '/nhl/teams/players/%3fname%3djonas%2bholos' not found
Referenced file '/nhl/teams/players/%3fname%3dmarc%2bstaal' not found
Referenced file '/nhl/teams/players/%3fname%3derik%2bkarlsson' not found
Referenced file '/nhl/teams/players/%3fname%3dalex%2bkovalev' not found
Referenced file '/nhl/teams/players/%3fname%3dmarc%2bmethot' not found
Referenced file '/nhl/teams/players/%3fname%3dryan%2bjones' not found
Referenced file '/nhl/teams/players/%3fname%3dmikael%2bsamuelsson' not found
Referenced file '/nhl/teams/players/%3fname%3dkris%2bletang' not found
Referenced file '/nhl/teams/players/%3fname%3dmarc-andre%2bfleury' not found
Referenced file '/nhl/teams/players/%3fname%3dmark%2bgiordano' not found
Referenced file '/nhl/teams/players/%3fname%3dduncan%2bkeith' not found
Referenced file '/nhl/teams/players/%3fname%3ddustin%2bbyfuglien' not found
Referenced file '/nhl/teams/players/%3fname%3dtomas%2bplekanec' not found
Referenced file '/nhl/teams/players/%3fname%3dladislav%2bsmid' not found
Referenced file '/fantasy_news' not found
Referenced file '/nhl/teams/players/%3fname%3dcolin%2bfraser' not found
Referenced file '/nhl/teams/players/%3fname%3dhenrik%2blundqvist' not found
Referenced file '/nhl/teams/players/%3fname%3dryan%2bcallahan' not found
Referenced file '/nhl/teams/players/%3fname%3dvictor%2bhedman' not found
Referenced file '/nhl/teams/players/%3fname%3ddan%2bhamhuis' not found
Referenced file '/nhl/teams/players/%3fname%3dchris%2bkunitz' not found
Referenced file '/nhl/teams/players/%3fname%3dshawn%2bhorcoff' not found
Referenced file '/nhl/teams/players/%3fname%3dsergei%2bfedorov' not found
Referenced file '/twitter' not found
Referenced file '/nhl/teams/players/%3fname%3dmatt%2bcooke' not found
Referenced file '/nhl/teams/players/%3fname%3dtoni%2blydman' not found
Referenced file '/nhl/teams/players/%3fname%3djarome%2biginla' not found
Referenced file '/nhl/teams/players/%3fname%3ddarroll%2bpowe' not found
Referenced file '/nhl/teams/players/%3fname%3dmike%2bcammalleri' not found
Referenced file '/nhl/teams/players/%3fname%3dluke%2brichardson' not found
Referenced file '/nhl/teams/players/%3fname%3dmike%2bweaver' not found
Referenced file '/nhl/teams/players/%3fname%3dbrandon%2bdubinsky' not found
Referenced file '/nhl/teams/players/%3fname%3droberto%2bluongo' not found
Referenced file '/nhl/teams/players/%3fname%3dhenrik%2bsedin' not found
Referenced file 'feed_1/index.html' not found
Referenced file '/nhl/teams/players/%3fname%3dmiikka%2bkiprusoff' not found
Referenced file '/nhl/teams/players/%3fname%3dmike%2brichards' not found
Referenced file '/nhl/teams/players/%3fname%3dsteve%2bmontador' not found
Referenced file '/nhl/teams/players/%3fname%3dsidney%2bcrosby' not found
Referenced file '/nhl/teams/players/%3fname%3dtom%2bkostopoulos' not found
Referenced file '/nhl/teams/players/%3fname%3dvernon%2bfiddler' not found
Referenced file '/nhl/teams/players/%3fname%3djeff%2bhalpern' not found
Referenced file '/nhl/teams/players/%3fname%3danton%2bbabchuk' not found
Referenced file '/nhl/teams/players/%3fname%3djason%2bspezza' not found
Referenced file '/nhl/teams/players/%3fname%3dbrian%2belliott' not found
Referenced file '/nhl/teams/players/%3fname%3dtyler%2bkennedy' not found
Referenced file '/nhl/teams/players/%3fname%3dnikolai%2bkhabibulin' not found
Reading TOC from NCX...
Merging user specified metadata...
Detecting structure...
Flattening CSS and remapping font sizes...
Python function terminated unexpectedly
(Error Code: 1)
Traceback (most recent call last):
File "site.py", line 103, in main
File "site.py", line 85, in run_entry_point
File "site-packages\calibre\utils\ipc\worker.py", line 90, in main
File "site-packages\calibre\gui2\convert\gui_conversion.py", line 21, in gui_convert
File "site-packages\calibre\ebooks\conversion\plumber.py", line 816, in run
File "site-packages\calibre\ebooks\oeb\transforms\flatcss.py" , line 122, in __call__
File "site-packages\calibre\ebooks\oeb\transforms\flatcss.py" , line 146, in stylize_spine
File "site-packages\calibre\ebooks\oeb\stylizer.py", line 173, in __init__
File "site-packages\calibre\ebooks\oeb\stylizer.py", line 96, in __init__
File "site-packages\lxml-2.2.2-py2.6-win32.egg\lxml\cssselect.py", line 522, in css_to_xpath
File "site-packages\lxml-2.2.2-py2.6-win32.egg\lxml\cssselect.py", line 476, in xpath
File "site-packages\lxml-2.2.2-py2.6-win32.egg\lxml\cssselect.py", line 247, in xpath
File "site-packages\lxml-2.2.2-py2.6-win32.egg\lxml\cssselect.py", line 257, in _xpath_root
NotImplementedError

For RDS, I may have an idea, will look into it. Anyway, thanks in advance for the help.

Nexus · 11-18-2010, 11:37 AM

Ok so here's a nicer recipe for THN:

Spoiler:

And I got the RDS one too.

Spoiler:

TSN remains a mystery...

Nexus · 11-19-2010, 10:56 AM

This TSN stuff is going to drive me nuts, lol. Can't figure out how to build the recipe. I should have said in my first post that I have none knowledge at all of python. I can identify HTML balises and I "guess" to what they link on a web page, but that's pretty much all.

Starting from this page (http://tsn.ca/nhl/story/?id=nhl), I understand I have to use the parse_index command in my recipe, but I don't know what to do with that. Python is just too much for me. If someone is kind enough to give me a hint, that be greatly appreciated. I'm not even asking for the full recipe, I'd like to understand the process, but after reading and reading tutorials and guides, I just can't figure out from where to start. That's beyond my comprehension.

Thanks.

Starson17 · 11-19-2010, 11:21 AM

Quote:

Originally Posted by Nexus

Starting from this page (http://tsn.ca/nhl/story/?id=nhl), I understand I have to use the parse_index command in my recipe,

Correct.

Quote:

someone is kind enough to give me a hint

Look at some samples. These all use parse_index:

Code:

DrawAndCook.recipe' :
akter.recipe' :
atlantic.recipe' :
auto_prove.recipe' :
axxon_magazine.recipe' :
billorielly.recipe' :
borba.recipe' :
brand_eins.recipe' :
businessworldin.recipe' :
bwmagazine.recipe' :
calgary_herald.recipe' :
comics_com.recipe' :
cynewslive.recipe' :
cyprus_weekly.recipe' :
dani.recipe' :
daum_net.recipe' :
deredactie.recipe' :
economist.recipe' :
economist_free.recipe' :
edmonton_journal.recipe' :
el_cultural.recipe' :
elpais_impreso.recipe' :
elpais_semanal.recipe' :
eluniversalimpresa.recipe' :
entrepeneur.recipe' :
financial_times_uk.recipe' :
fokkeensukke.recipe' :
foreignaffairs.recipe' :
fstream.recipe' :
glas_srpske.recipe' :
go_comics.recipe' :
guardian.recipe' :
haaretz_en.recipe' :
harpers_full.recipe' :
hbr.recipe' :
hbr_blogs.recipe' :
hindu.recipe' :
houston_chronicle.recipe' :
ieeespectrum.recipe' :
inc.recipe' :
india_today.recipe' :
instapaper.recipe' :
johm.recipe' :
joop.recipe' :
kellog_faculty.recipe' :
kidney.recipe' :
lamujerdemivida.recipe' :
laprensa_ni.recipe' :
lemonde_dip.recipe' :
lenta_ru.recipe' :
losservatoreromano_it.recipe' :
lrb_payed.recipe' :
macleans.recipe' :
malaysian_mirror.recipe' :
milenio.recipe' :
ming_pao.recipe' :
monitor.recipe' :
montreal_gazette.recipe' :
national_post.recipe' :
ncrnext.recipe' :
nejm.recipe' :
new_york_review_of_books.recipe' :
new_york_review_of_books_no_sub.recipe' :
newsweek.recipe' :
newsweek_polska.recipe' :
nin.recipe' :
nymag.recipe' :

Since I wrote it, and it's first on the list, let's look at the relevant parts of DrawandCook

Code:

    def parse_index(self):
        feeds = []
        for title, url in [
                            ("They Draw and Cook", "http://www.theydrawandcook.com/")
                            ]:
            articles = self.make_links(url)
            if articles:
                feeds.append((title, articles))
        print 'feeds are: ', feeds
        return feeds

    def make_links(self, url):
        soup = self.index_to_soup(url)
        title = ''
        date = ''
        current_articles = []
        soup = self.index_to_soup(url)
        recipes = soup.findAll('div', attrs={'class': 'date-outer'})
        for recipe in recipes:
            title = recipe.h3.a.string
            page_url = recipe.h3.a['href']
            current_articles.append({'title': title, 'url': page_url, 'description':'', 'date':date})
        return current_articles

The parse_index method needs to return a feed and a list of articles for that feed. The structure above is set up for multiple feeds, but only does a single one, and that's what you want to do, too (unless you want to build multiple feeds).
The hard part is the list of articles, and that's done in make_links. You need to find a title and a url for each article. The date and description can be left blank, or filled in, as you prefer.

You can find the url and title for each article on your page (http://tsn.ca/nhl/story/?id=nhl). Just modify the Feed title and url of your page in parse_feeds, then modify make_links so that the findAll finds all your links, and the for loop finds the title and page_url for each.

Simple.

Nexus · 11-19-2010, 01:19 PM

Thanks for the help Starson17. That's not bad will on my side, but python is mumbo jumbo to me.

I think this is the tricky part for me, I'm not sure what to do.

Code:

            
    def make_links(self, url):
        soup = self.index_to_soup(url)
        title = ''
        date = ''
        current_articles = []
        soup = self.index_to_soup(url)
        recipes = soup.findAll('div', attrs={'class': 'date-outer'})

I have difficulties to see what tag I have to use, and most of all where I grab it (http://tsn.ca/nhl/story/?id=nhl) or article page? <div id= tsnColWrap> and <div id = tsnMain> appear on both pages, and div class = feature> only on the "main page" (...story/?id=nhl)

Code:

for recipe in recipes:
            title = recipe.h3.a.string
            page_url = recipe.h3.a['href']
            current_articles.append({'title': title, 'url': page_url, 'description':'', 'date':date})
        return current_articles

And there goes my mental health...

I have to modify the "title" and "page_url" line right? But same as above, I'm not sure where to look at and what to put there. Tried different things, got error messages each time. By the way, I added "from calibre.ebooks.BeautifulSoup import BeautifulSoup" at the begining of the recipe, I think I have to call that in order to make it work.

I'm a lost cause...

Starson17 · 11-19-2010, 01:32 PM

Quote:

Originally Posted by Nexus

Thanks for the help Starson17. That's not bad will on my side, but python is mumbo jumbo to me.

I think this is the tricky part for me, I'm not sure what to do.

Code:

            
    def make_links(self, url):
        soup = self.index_to_soup(url)
        title = ''
        date = ''
        current_articles = []
        soup = self.index_to_soup(url)
        recipes = soup.findAll('div', attrs={'class': 'date-outer'})

I have difficulties to see what tag I have to use, and most of all where I grab it (http://tsn.ca/nhl/story/?id=nhl) or article page? <div id= tsnColWrap> and <div id = tsnMain> appear on both pages, and div class = feature> only on the "main page" (...story/?id=nhl)

Code:

for recipe in recipes:
            title = recipe.h3.a.string
            page_url = recipe.h3.a['href']
            current_articles.append({'title': title, 'url': page_url, 'description':'', 'date':date})
        return current_articles

And there goes my mental health...

I have to modify the "title" and "page_url" line right? But same as above, I'm not sure where to look at and what to put there. Tried different things, got error messages each time. By the way, I added "from calibre.ebooks.BeautifulSoup import BeautifulSoup" at the begining of the recipe, I think I have to call that in order to make it work.

I'm a lost cause...

I'll walk you through it, (if no one else does it first), but I'm a bit busy now, so I'll give it to you in dribs/drabs.

Start with the parse_feeds. I looked at your page. I think I was wrong when I said you want one feed. I'd use one feed per day, then put the articles for that day under that feed. Let's do this. You put together as much of the recipe as you can, and post it. I'll look it over. You should have enough to do just the parse_feeds part. Post that, with the rest of your recipe. Then I'll help with the make_links. Post your best shot on that too.

You may want to install FireBug in FireFox if you haven't done it yet. Yes, you needed to import BeautifulSoup.

Starson17 · 11-19-2010, 04:09 PM

Here's a start:

Spoiler:

This will get the feed title and a soup ("feed_part") that has links and titles for all the articles for that feed.

Starson17 · 11-19-2010, 05:24 PM

Quote:

Originally Posted by Starson17

Here's a start:

I had a few minutes to finish parse_index:

Code:

    INDEX = 'http://tsn.ca/nhl/story/?id=nhl'    

    def parse_index(self):
        feeds = []
        soup = self.index_to_soup(self.INDEX)
        feed_parts = soup.findAll('div', attrs={'class': 'feature'})
        for feed_part  in feed_parts:
            articles = []
            if not feed_part.h2:
                continue
            feed_title = feed_part.h2.string
            article_parts = feed_part.findAll('a')
            for article_part in article_parts:
                article_title = article_part.string
                article_date = ''
                article_url = 'http://tsn.ca/' + article_part['href']
                articles.append({'title': article_title, 'url': article_url, 'description':'', 'date':article_date})
            if articles:
                feeds.append((feed_title, articles))
        return feeds

All you need to do now is remove the junk.

Nexus · 11-20-2010, 08:09 AM

Wow. Not a chance I could came up with something like that. Thanks a lot Starson. I removed the junk and it works just fine. Thanks a lot.

Spoiler:

Starson17 · 11-22-2010, 12:53 PM

Quote:

Originally Posted by Nexus

Wow. Not a chance I could came up with something like that. Thanks a lot Starson.

You're welcome. Thank you for posting the finished recipe.

Nexus · 11-23-2010, 09:47 AM

The least I can do...

11-16-2010, 10:53 AM	#1
Nexus Member Posts: 11 Karma: 10 Join Date: Nov 2010 Location: France Device: PRS-600	Recipes for RDS.ca, TSN.ca and TheHockeynews.com Hey guys and gals, I thought I would sort it out with all the FAQs and great tutorials around here (I've read a lot before posting), but I just can't build my own recipes once it gets a bit complicated. I've been working on these 3 sites for a couple of days, and I just don't find a way to retrieve the news correctly. If someone could help me, I'd be forever obliged. Thanks in advance. http://www.rds.ca/hockey/fildepresse_rds.xml http://www.tsn.ca/datafiles/rss/Stories.xml http://www.thehockeynews.com/rss/all_categories.xml

11-19-2010, 01:19 PM	#8
Nexus Member Posts: 11 Karma: 10 Join Date: Nov 2010 Location: France Device: PRS-600	Thanks for the help Starson17. That's not bad will on my side, but python is mumbo jumbo to me. I think this is the tricky part for me, I'm not sure what to do. Code: def make_links(self, url): soup = self.index_to_soup(url) title = '' date = '' current_articles = [] soup = self.index_to_soup(url) recipes = soup.findAll('div', attrs={'class': 'date-outer'}) I have difficulties to see what tag I have to use, and most of all where I grab it (http://tsn.ca/nhl/story/?id=nhl) or article page? <div id= tsnColWrap> and <div id = tsnMain> appear on both pages, and div class = feature> only on the "main page" (...story/?id=nhl) Code: for recipe in recipes: title = recipe.h3.a.string page_url = recipe.h3.a['href'] current_articles.append({'title': title, 'url': page_url, 'description':'', 'date':date}) return current_articles And there goes my mental health... I have to modify the "title" and "page_url" line right? But same as above, I'm not sure where to look at and what to put there. Tried different things, got error messages each time. By the way, I added "from calibre.ebooks.BeautifulSoup import BeautifulSoup" at the begining of the recipe, I think I have to call that in order to make it work. I'm a lost cause...

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Where my recipes are kept?	bthoven	Calibre	6	02-26-2010 01:20 AM
Best Newspaper Recipes	geneaber	Calibre	1	11-28-2009 12:10 PM
NY Times Recipes	geneaber	Calibre	0	11-08-2009 11:16 PM
how to remove recipes	reup	Calibre	2	08-31-2009 11:26 AM
Help with RSS recipes	fmma	Calibre	1	06-15-2009 12:51 PM

11-17-2010, 02:09 AM	#2
somedayson Member Posts: 13 Karma: 10 Join Date: Sep 2010 Device: K3	Not much help, but I use this to get the daily hockey headlines from thn: class AdvancedUserRecipe1283848394(BasicNewsRecipe): title = u'Hockey News' oldest_article = 1 max_articles_per_feed = 100 feeds = [(u'Hockey News', u'http://www.thehockeynews.com/rss/9-Headlines.xml')][/QUOTE][/QUOTE]

11-19-2010, 10:56 AM	#6
Nexus Member Posts: 11 Karma: 10 Join Date: Nov 2010 Location: France Device: PRS-600	This TSN stuff is going to drive me nuts, lol. Can't figure out how to build the recipe. I should have said in my first post that I have none knowledge at all of python. I can identify HTML balises and I "guess" to what they link on a web page, but that's pretty much all. Starting from this page (http://tsn.ca/nhl/story/?id=nhl), I understand I have to use the parse_index command in my recipe, but I don't know what to do with that. Python is just too much for me. If someone is kind enough to give me a hint, that be greatly appreciated. I'm not even asking for the full recipe, I'd like to understand the process, but after reading and reading tutorials and guides, I just can't figure out from where to start. That's beyond my comprehension. Thanks.

11-23-2010, 09:47 AM	#14
Nexus Member Posts: 11 Karma: 10 Join Date: Nov 2010 Location: France Device: PRS-600	The least I can do...

Advert

Advert