View Single Post
Old 08-30-2010, 04:48 PM   #2567
TonytheBookworm
Addict
TonytheBookworm is on a distinguished road
 
TonytheBookworm's Avatar
 
Posts: 264
Karma: 62
Join Date: May 2010
Device: kindle 2, kindle 3, Kindle fire
Starson17,
I went back to the Gocomic recipe and tried to follow what you were doing and using what you stated about printing the title, url and so forth. The code I have currently it gets the soup as indicated in the output.txt file but then it craps out saying the index is out of range. I thought that was why you put number of pages to get in a range field. I set mine to 7 as you can see in my code but again I get index out or range.... I feel like the little Engine that Could or better yet the Ant at the Rubber Tree Plant. I got high hopes haha..
Spoiler:

Code:
from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup

class FIELDSTREAM(BasicNewsRecipe):

    title      = 'FIELD AND STREAM BLOGS'
    __author__ = 'Tony Stegall'
    description = 'Hunting and Fishing and Gun Talk'
    INDEX = 'http://www.fieldandstream.com/blogs'
    language = 'en'
    #------------------------------------------------------
    #variables
    num_pages_to_get = 7
    #-------------------------------------------------------
    
    no_stylesheets = True

    def parse_index(self):
        feeds = []
        for title, url in [
                            (u"Wild Chef", u"http://www.fieldandstream.com/blogs/wild-chef"),
                             ]:
            articles = self.make_links(url)
            if articles:
                feeds.append((title, articles))
        return feeds
        
    def make_links(self, url):
        title = 'Temp'
        current_articles = []
        page_soup = self.index_to_soup(url)
        print 'The soup is: ', page_soup
       
        pages = range(1, self.num_pages_to_get+1)   # put this in to start with the first page and then go up to 7 increment by 1
        for page in pages: 
            if page_soup: 
                try:
                  strip_title = page_soup.h2.a.string  # try to strip the string(text) from the h2 tag
                except:
                  strip_title = 'Error - no page_soup.h2.a.string' # throw an error if it can't find it
                try:
                  date_title = page_soup.find('ul', attrs={'class': 'first even'}).li.string #get the date from the li tag text
                except:
                  date_title = 'Error - no page_soup.h2.li.string'
                title = strip_title + ' - ' + date_title #piece the title together here
                try:
                   url = page_soup.h2.a['href'] #try to get the url from the h2 tags <a> 
                   break
                except:
                   continue
                continue
               
                print 'the title is: ', title
                print 'the page_url is: ', page_url
           
                current_articles.append({'title': title, 'url': page_url, 'description':'', 'date':''}) # append all this
        
        
        
        return current_articles


This is like playing battleship, I'm firing and firing and I get close but not getting a direct hit.
TonytheBookworm is offline