Quote:
Originally Posted by Starson17
Let's start at the top and look at the broad structure of what you're trying to do.
|
Here: look at this:
Spoiler:
Code:
def parse_index(self):
feeds = []
for title, url in [
(u"Wild Chef", u"http://www.fieldandstream.com/blogs/wild-chef"),
]:
articles = self.make_links(url)
if articles:
feeds.append((title, articles))
return feeds
def make_links(self, url):
title = 'Temp'
current_articles = []
soup = self.index_to_soup(url)
print 'The soup is: ', soup
#pseudocode starts here
for each <a> tag or <li> tag (i.e., each article) on this url page do this:
#parse out the title, url, description and date using BS
# now append the parsed stuff
# put some print statements in here to track what you're doing
# and make sure it's working. e.g. print 'the url is: ', url
# or: print 'the title is: ', title
# put these after you think you've extracted the title, etc.
current_articles.append({'title': title, 'url': page_url, 'description':'', 'date':''})
#pseudocode ends here
return current_articles
Edit:
Start with the above. It will give you the basic structure, since your code didn't appear to get to the page you needed to parse. The code above should get you there (check the printed soup to confirm in your output file). Once you have the soup being printed, we can work on the pseudocode. You should be able to adapt your own parsing code (as you posted) to replace the pseudocode above.
Note that you can leave description and date blank for testing. You only need to parse a title (and you can even set that to a constant) and just parse out the article URL.