View Single Post
Old 08-30-2010, 09:32 AM   #2562
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by TonytheBookworm View Post
Alright I looked at some samples and I also seen what you had done. I went the second method that you mentioned though about making my own links. Well, I thought I was obviously not working. Here is what I am up with. if you have the time could you look at this and kinda shed some more light on me. Thanks.
Spoiler:

Code:
from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup

class FIELDSTREAM(BasicNewsRecipe):

    title      = 'FIELD AND STREAM BLOGS'
    __author__ = 'Tony Stegall'
    description = 'Hunting and Fishing and Gun Talk'
    INDEX = 'http://www.fieldandstream.com/blogs'
    language = 'en'
    no_stylesheets = True
    def parse_index(self):
        soup = self.index_to_soup(url)
        feeds =[]
        #array to hold the feeds
        for mainsec in soup.findAll('div',  attrs={'class':'item-list'}):
            #above findall instances where the div tag has the attribute of item-list
            section_title ='Wild Chef'
            #hard code the section title to be appended to the feed
            articles = []
            #array to hold the article content
            #-----------------------------------------------------------------------
            #trying to find all the h2 tags and parse the <a> for the title
            #not really understanding how this is done though
#=-----------------------------------------------------------------------
            h = feedhead.find(['h2'])
            #find the h2 tag that has the title embedded inside it with an anchor tag
            
            a = mainsec.find('a', href=True)
            title = self.tag_to_string(a)
            myurl = a['href']
            if myurl.startswith('/'):
               myurl = 'http://www.fieldandstream.com' + url
               
            #--end of parse for title-----------------------------------------------------
#-----------------------------------------------------------------------------------------------------------
            #face the same problem with the p tags.  I have a <p> tag then a <em> then in some cases another <p>
            #I want to get the content of the <p> within the <p> but not sure how :( 
            #example: 
            #   <p>
            #      <p> some blah blah blah </p>
            #   so basically all i want is all the text within the <div class=teaser> but not sure how :(
            for teaser in mainsec.findall('div',  attrs={'class':'teaser'}):
                p = post.find('p')
                desc = None
                if p is not None:
                    desc = self.tag_to_string(p)
            
                articles.append({'title':title, 'url':myurl, 'description':desc,
                    'date':''}) 
            #--------------------end of description parse from teaser-----------------------------------------------
            
             
            feeds.append((section_title, articles))  
            #put all articles for the section inside the feeds 
            return feeds
Let's start at the top and look at the broad structure of what you're trying to do. I suggested you run parse_index with data pairs composed of a title for a feed and a url for a feed, then write a function that took the URL and parsed the page. I suggested you start the function with:
soup = self.index_to_soup(url)
where "url" was the url being passed to that function. In your code, you've taken the code that should have been in the called function and put it as the first line, but "url" isn't defined, so you never get a soup to work with.

To write effectively, you need to use print statements to see what's happening. Put
print 'the soup is: ', soup
after the line to see what the soup is is and you'll see url is not yet defined and tehre is no soup. If you're not going to do it the way GoComics did it, I suspect you want:
soup = self.index_to_soup("http://www.fieldandstream.com/blogs/wild-chef")
However, doing it this way will only give you one feed - the one for Wild Chef. Doing it the way GoComics does will let you set up multiple feeds.
Starson17 is offline