Quote:
Originally Posted by TonytheBookworm
Starson17,
I went back to the Gocomic recipe and tried to follow what you were doing and using what you stated about printing the title, url and so forth. The code I have currently it gets the soup as indicated in the output.txt file but then it craps out saying the index is out of range. I thought that was why you put number of pages to get in a range field. I set mine to 7 as you can see in my code but again I get index out or range....  I feel like the little Engine that Could or better yet the Ant at the Rubber Tree Plant. I got high hopes haha..
This is like playing battleship, I'm firing and firing and I get close but not getting a direct hit.
|
You don't need the range - that was used for my special case where I could calculate the urls. You need to scrape them
Look at this:
Spoiler:
Code:
from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup
class FIELDSTREAM(BasicNewsRecipe):
title = 'Field and Stream'
__author__ = 'Starson17'
description = 'Hunting and Fishing and Gun Talk'
language = 'en'
no_stylesheets = True
publisher = 'Starson17'
category = 'food recipes'
use_embedded_content= False
no_stylesheets = True
oldest_article = 24
remove_javascript = True
remove_empty_feeds = True
#cover_url = 'http://www.bsb.lib.tx.us/images/comics.com.gif'
# recursions = 0
max_articles_per_feed = 10
INDEX = 'http://www.fieldandstream.com'
def parse_index(self):
feeds = []
for title, url in [
(u"Wild Chef", u"http://www.fieldandstream.com/blogs/wild-chef"),
]:
articles = self.make_links(url)
if articles:
feeds.append((title, articles))
return feeds
def make_links(self, url):
title = 'Temp'
current_articles = []
soup = self.index_to_soup(url)
print 'The soup is: ', soup
for item in soup.findAll('h2'):
print 'item is: ', item
link = item.find('a')
print 'the link is: ', link
if link:
url = self.INDEX + link['href']
title = self.tag_to_string(link)
print 'the title is: ', title
print 'the url is: ', url
current_articles.append({'title': title, 'url': url, 'description':'', 'date':''}) # append all this
return current_articles
It does all the url scraping for the feed. (You can add more feeds if you want) It's up to you to remove the junk with keep or remove tags.