View Single Post
Old 08-29-2010, 08:59 PM   #2557
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by TonytheBookworm View Post
so I understand that that is looking for all h1 tags with a class=sectionTitle
but in my case I only have a href inside the h2 tags. sorry for all the questions just trying to learn
Yes, and the h2 is inside a <div class="item-list"> element, etc. You would use Beautiful Soup to do this.

Let me refer to my GoComics recipe, as I'm more familiar with it.
Spoiler:
Code:
    def parse_index(self):
        feeds = []
        for title, url in [
                            (u"2 Cows and a Chicken", u"http://www.gocomics.com/2cowsandachicken"),
                            (u"The Argyle Sweater", u"http://www.gocomics.com/theargylesweater"),
                             ]:
            articles = self.make_links(url)
            if articles:
                feeds.append((title, articles))
        return feeds

Above are the pairs of a title for a feed and a URL to scrape for articles. You would stick this in:
Code:
                            (u"Wild Chef", u"http://www.fieldandstream.com/blogs/wild-chef"),
Now normally, make_links(url) would scrape for the articles when you pass it the url. In my case, I didn't need to scrape, I could just figure out the article urls (each comic) and build the titles, but you can write your make_links(url) to scrape the URL.

You'd start with getting a soup for the url:
soup = self.index_to_soup(url)
then start scraping out the article urls and titles, etc. As you said, you have "href inside the h2 tags" the article title is really the string (NavigableString) inside the <a> tag. The url is the href atribute of the <a> tag (with a base URL stuck in front), and the summary is there too.

All of those are easily obtained using Beautiful Soup from the soup of the url given above. Scrape the url, build your article list for that feed, then it gets returned to parse_index and the next feed gets processed, etc.

I'm glad to see you working on a recipe (calibre-type) of recipes (food-type) - they're my favorite

Last edited by Starson17; 08-29-2010 at 09:02 PM.
Starson17 is offline