![]() |
#1 |
Member
![]() Posts: 11
Karma: 10
Join Date: Nov 2010
Location: France
Device: PRS-600
|
Recipes for RDS.ca, TSN.ca and TheHockeynews.com
Hey guys and gals,
I thought I would sort it out with all the FAQs and great tutorials around here (I've read a lot before posting), but I just can't build my own recipes once it gets a bit complicated. I've been working on these 3 sites for a couple of days, and I just don't find a way to retrieve the news correctly. If someone could help me, I'd be forever obliged. Thanks in advance. ![]() http://www.rds.ca/hockey/fildepresse_rds.xml http://www.tsn.ca/datafiles/rss/Stories.xml http://www.thehockeynews.com/rss/all_categories.xml |
![]() |
![]() |
![]() |
#2 |
Member
![]() Posts: 13
Karma: 10
Join Date: Sep 2010
Device: K3
|
Not much help, but I use this to get the daily hockey headlines from thn:
class AdvancedUserRecipe1283848394(BasicNewsRecipe): title = u'Hockey News' oldest_article = 1 max_articles_per_feed = 100 feeds = [(u'Hockey News', u'http://www.thehockeynews.com/rss/9-Headlines.xml')][/QUOTE][/QUOTE] |
![]() |
![]() |
Advert | |
|
![]() |
#3 |
Member
![]() Posts: 11
Karma: 10
Join Date: Nov 2010
Location: France
Device: PRS-600
|
Thanks somedayson, for whatever reason the RSS link I tried the other day for THN didn't work. It's working fine now. I've worked a bit on and I got some results, except that I have big time difficulties to remove tags that come after the article, I tried "stuff" like remove_tags_after/before etc, with no success unfortunately.
Here's the recipe, not looking pretty I know, but that's all I could come up with with my knowledge. You have no idea how long it took me... ![]() Code:
class AdvancedUserRecipe1289990851(BasicNewsRecipe): title = u'THE HOCKEY NEWS' oldest_article = 7 max_articles_per_feed = 5 no_stylesheets = True remove_tags = [dict(name='div', attrs={'class':'article_info'}), dict(name='div', attrs={'class':'photo_details'}), dict(name='div', attrs={'id':'comments_container'}), dict(name='div', attrs={'id':'add_comment'}), dict(name='div', attrs={'id':'legal_info'}), dict(name='div', attrs={'id':'breadcrumb'}), dict(name='div', attrs={'id':'site_header'}), dict(name='div', attrs={'id':'site_navigation'}), dict(name='div', attrs={'id':'advertisement'}), dict(name='div', attrs={'class':'tool_menu'})] feeds = [(u'THN', u'http://www.thehockeynews.com/rss/all_categories.xml')] |
![]() |
![]() |
![]() |
#4 |
Member
![]() Posts: 11
Karma: 10
Join Date: Nov 2010
Location: France
Device: PRS-600
|
By the way here's the error message I get for TSN (I used the basic recipe)
Spoiler:
For RDS, I may have an idea, will look into it. Anyway, thanks in advance for the help. Last edited by Nexus; 11-17-2010 at 05:20 PM. |
![]() |
![]() |
![]() |
#5 |
Member
![]() Posts: 11
Karma: 10
Join Date: Nov 2010
Location: France
Device: PRS-600
|
Ok so here's a nicer recipe for THN:
Spoiler:
And I got the RDS one too. Spoiler:
TSN remains a mystery... |
![]() |
![]() |
Advert | |
|
![]() |
#6 |
Member
![]() Posts: 11
Karma: 10
Join Date: Nov 2010
Location: France
Device: PRS-600
|
This TSN stuff is going to drive me nuts, lol. Can't figure out how to build the recipe. I should have said in my first post that I have none knowledge at all of python. I can identify HTML balises and I "guess" to what they link on a web page, but that's pretty much all.
Starting from this page (http://tsn.ca/nhl/story/?id=nhl), I understand I have to use the parse_index command in my recipe, but I don't know what to do with that. Python is just too much for me. If someone is kind enough to give me a hint, that be greatly appreciated. I'm not even asking for the full recipe, I'd like to understand the process, but after reading and reading tutorials and guides, I just can't figure out from where to start. That's beyond my comprehension. Thanks. |
![]() |
![]() |
![]() |
#7 | ||
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Quote:
Quote:
Code:
DrawAndCook.recipe' : akter.recipe' : atlantic.recipe' : auto_prove.recipe' : axxon_magazine.recipe' : billorielly.recipe' : borba.recipe' : brand_eins.recipe' : businessworldin.recipe' : bwmagazine.recipe' : calgary_herald.recipe' : comics_com.recipe' : cynewslive.recipe' : cyprus_weekly.recipe' : dani.recipe' : daum_net.recipe' : deredactie.recipe' : economist.recipe' : economist_free.recipe' : edmonton_journal.recipe' : el_cultural.recipe' : elpais_impreso.recipe' : elpais_semanal.recipe' : eluniversalimpresa.recipe' : entrepeneur.recipe' : financial_times_uk.recipe' : fokkeensukke.recipe' : foreignaffairs.recipe' : fstream.recipe' : glas_srpske.recipe' : go_comics.recipe' : guardian.recipe' : haaretz_en.recipe' : harpers_full.recipe' : hbr.recipe' : hbr_blogs.recipe' : hindu.recipe' : houston_chronicle.recipe' : ieeespectrum.recipe' : inc.recipe' : india_today.recipe' : instapaper.recipe' : johm.recipe' : joop.recipe' : kellog_faculty.recipe' : kidney.recipe' : lamujerdemivida.recipe' : laprensa_ni.recipe' : lemonde_dip.recipe' : lenta_ru.recipe' : losservatoreromano_it.recipe' : lrb_payed.recipe' : macleans.recipe' : malaysian_mirror.recipe' : milenio.recipe' : ming_pao.recipe' : monitor.recipe' : montreal_gazette.recipe' : national_post.recipe' : ncrnext.recipe' : nejm.recipe' : new_york_review_of_books.recipe' : new_york_review_of_books_no_sub.recipe' : newsweek.recipe' : newsweek_polska.recipe' : nin.recipe' : nymag.recipe' : Code:
def parse_index(self): feeds = [] for title, url in [ ("They Draw and Cook", "http://www.theydrawandcook.com/") ]: articles = self.make_links(url) if articles: feeds.append((title, articles)) print 'feeds are: ', feeds return feeds def make_links(self, url): soup = self.index_to_soup(url) title = '' date = '' current_articles = [] soup = self.index_to_soup(url) recipes = soup.findAll('div', attrs={'class': 'date-outer'}) for recipe in recipes: title = recipe.h3.a.string page_url = recipe.h3.a['href'] current_articles.append({'title': title, 'url': page_url, 'description':'', 'date':date}) return current_articles The hard part is the list of articles, and that's done in make_links. You need to find a title and a url for each article. The date and description can be left blank, or filled in, as you prefer. You can find the url and title for each article on your page (http://tsn.ca/nhl/story/?id=nhl). Just modify the Feed title and url of your page in parse_feeds, then modify make_links so that the findAll finds all your links, and the for loop finds the title and page_url for each. Simple. Last edited by Starson17; 11-19-2010 at 10:39 AM. |
||
![]() |
![]() |
![]() |
#8 |
Member
![]() Posts: 11
Karma: 10
Join Date: Nov 2010
Location: France
Device: PRS-600
|
Thanks for the help Starson17. That's not bad will on my side, but python is mumbo jumbo to me.
I think this is the tricky part for me, I'm not sure what to do. Code:
def make_links(self, url):
soup = self.index_to_soup(url)
title = ''
date = ''
current_articles = []
soup = self.index_to_soup(url)
recipes = soup.findAll('div', attrs={'class': 'date-outer'})
Code:
for recipe in recipes: title = recipe.h3.a.string page_url = recipe.h3.a['href'] current_articles.append({'title': title, 'url': page_url, 'description':'', 'date':date}) return current_articles I have to modify the "title" and "page_url" line right? But same as above, I'm not sure where to look at and what to put there. Tried different things, got error messages each time. By the way, I added "from calibre.ebooks.BeautifulSoup import BeautifulSoup" at the begining of the recipe, I think I have to call that in order to make it work. I'm a lost cause... |
![]() |
![]() |
![]() |
#9 | |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Quote:
Start with the parse_feeds. I looked at your page. I think I was wrong when I said you want one feed. I'd use one feed per day, then put the articles for that day under that feed. Let's do this. You put together as much of the recipe as you can, and post it. I'll look it over. You should have enough to do just the parse_feeds part. Post that, with the rest of your recipe. Then I'll help with the make_links. Post your best shot on that too. You may want to install FireBug in FireFox if you haven't done it yet. Yes, you needed to import BeautifulSoup. |
|
![]() |
![]() |
![]() |
#10 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Here's a start:
Spoiler:
This will get the feed title and a soup ("feed_part") that has links and titles for all the articles for that feed. Last edited by Starson17; 11-19-2010 at 03:21 PM. |
![]() |
![]() |
![]() |
#11 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
I had a few minutes to finish parse_index:
Code:
INDEX = 'http://tsn.ca/nhl/story/?id=nhl' def parse_index(self): feeds = [] soup = self.index_to_soup(self.INDEX) feed_parts = soup.findAll('div', attrs={'class': 'feature'}) for feed_part in feed_parts: articles = [] if not feed_part.h2: continue feed_title = feed_part.h2.string article_parts = feed_part.findAll('a') for article_part in article_parts: article_title = article_part.string article_date = '' article_url = 'http://tsn.ca/' + article_part['href'] articles.append({'title': article_title, 'url': article_url, 'description':'', 'date':article_date}) if articles: feeds.append((feed_title, articles)) return feeds |
![]() |
![]() |
![]() |
#12 |
Member
![]() Posts: 11
Karma: 10
Join Date: Nov 2010
Location: France
Device: PRS-600
|
Wow. Not a chance I could came up with something like that. Thanks a lot Starson. I removed the junk and it works just fine. Thanks a lot.
Spoiler:
|
![]() |
![]() |
![]() |
#13 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
|
![]() |
![]() |
![]() |
#14 |
Member
![]() Posts: 11
Karma: 10
Join Date: Nov 2010
Location: France
Device: PRS-600
|
The least I can do...
![]() |
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Where my recipes are kept? | bthoven | Calibre | 6 | 02-26-2010 12:20 AM |
Best Newspaper Recipes | geneaber | Calibre | 1 | 11-28-2009 11:10 AM |
NY Times Recipes | geneaber | Calibre | 0 | 11-08-2009 10:16 PM |
how to remove recipes | reup | Calibre | 2 | 08-31-2009 10:26 AM |
Help with RSS recipes | fmma | Calibre | 1 | 06-15-2009 11:51 AM |