View Single Post
Old 08-30-2010, 10:09 AM   #2563
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by Starson17 View Post
Let's start at the top and look at the broad structure of what you're trying to do.
Here: look at this:
Spoiler:
Code:
    def parse_index(self):
        feeds = []
        for title, url in [
                            (u"Wild Chef", u"http://www.fieldandstream.com/blogs/wild-chef"),
                             ]:
            articles = self.make_links(url)
            if articles:
                feeds.append((title, articles))
        return feeds
        
    def make_links(self, url):
        title = 'Temp'
        current_articles = []
        soup = self.index_to_soup(url)
        print 'The soup is: ', soup
       #pseudocode starts here
        for each <a> tag or <li> tag (i.e., each article) on this url page do this:
            #parse out the title, url, description and date using BS
            # now append the parsed stuff 
            # put some print statements in here to track what you're doing 
            # and make sure it's working.  e.g. print 'the url is: ', url
            # or:  print 'the title is: ', title
            # put these after you think you've extracted the title, etc.
            current_articles.append({'title': title, 'url': page_url, 'description':'', 'date':''})
       #pseudocode ends here
        return current_articles

Edit:
Start with the above. It will give you the basic structure, since your code didn't appear to get to the page you needed to parse. The code above should get you there (check the printed soup to confirm in your output file). Once you have the soup being printed, we can work on the pseudocode. You should be able to adapt your own parsing code (as you posted) to replace the pseudocode above.

Note that you can leave description and date blank for testing. You only need to parse a title (and you can even set that to a constant) and just parse out the article URL.

Last edited by Starson17; 08-30-2010 at 10:57 AM.
Starson17 is offline