Quote:
Originally Posted by TonytheBookworm
so I understand that that is looking for all h1 tags with a class=sectionTitle
but in my case I only have a href inside the h2 tags.  sorry for all the questions just trying to learn 
|
Yes, and the h2 is inside a <div class="item-list"> element, etc. You would use Beautiful Soup to do this.
Let me refer to my GoComics recipe, as I'm more familiar with it.
Above are the pairs of a title for a feed and a URL to scrape for articles. You would stick this in:
Code:
(u"Wild Chef", u"http://www.fieldandstream.com/blogs/wild-chef"),
Now normally, make_links(url) would scrape for the articles when you pass it the url. In my case, I didn't need to scrape, I could just figure out the article urls (each comic) and build the titles, but you can write your make_links(url) to scrape the URL.
You'd start with getting a soup for the url:
soup = self.index_to_soup(url)
then start scraping out the article urls and titles, etc. As you said, you have "href inside the h2 tags" the article title is really the string (NavigableString) inside the <a> tag. The url is the href atribute of the <a> tag (with a base URL stuck in front), and the summary is there too.
All of those are easily obtained using Beautiful Soup from the soup of the url given above. Scrape the url, build your article list for that feed, then it gets returned to parse_index and the next feed gets processed, etc.
I'm glad to see you working on a recipe (calibre-type) of recipes (food-type) - they're my favorite