View Single Post
Old 08-30-2010, 07:31 PM   #2570
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by TonytheBookworm View Post
Starson17,
I went back to the Gocomic recipe and tried to follow what you were doing and using what you stated about printing the title, url and so forth. The code I have currently it gets the soup as indicated in the output.txt file but then it craps out saying the index is out of range. I thought that was why you put number of pages to get in a range field. I set mine to 7 as you can see in my code but again I get index out or range.... I feel like the little Engine that Could or better yet the Ant at the Rubber Tree Plant. I got high hopes haha..
This is like playing battleship, I'm firing and firing and I get close but not getting a direct hit.
You don't need the range - that was used for my special case where I could calculate the urls. You need to scrape them
Look at this:
Spoiler:

Code:
from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup
class FIELDSTREAM(BasicNewsRecipe):
    title      = 'Field and Stream'
    __author__ = 'Starson17'
    description = 'Hunting and Fishing and Gun Talk'
    language = 'en'
    no_stylesheets = True
    publisher           = 'Starson17'
    category            = 'food recipes'
    use_embedded_content= False
    no_stylesheets      = True
    oldest_article      = 24
    remove_javascript   = True
    remove_empty_feeds    = True
    #cover_url           = 'http://www.bsb.lib.tx.us/images/comics.com.gif'
    # recursions          = 0
    max_articles_per_feed = 10
    INDEX = 'http://www.fieldandstream.com'
    def parse_index(self):
        feeds = []
        for title, url in [
                            (u"Wild Chef", u"http://www.fieldandstream.com/blogs/wild-chef"),
                             ]:
            articles = self.make_links(url)
            if articles:
                feeds.append((title, articles))
        return feeds
        
    def make_links(self, url):
        title = 'Temp'
        current_articles = []
        soup = self.index_to_soup(url)
        print 'The soup is: ', soup
        for item in soup.findAll('h2'):
            print 'item is: ', item
            link = item.find('a')
            print 'the link is: ', link
            if link:
                url         = self.INDEX + link['href']
                title       = self.tag_to_string(link)
                print 'the title is: ', title
                print 'the url is: ', url
                current_articles.append({'title': title, 'url': url, 'description':'', 'date':''}) # append all this
        return current_articles


It does all the url scraping for the feed. (You can add more feeds if you want) It's up to you to remove the junk with keep or remove tags.

Last edited by Starson17; 08-31-2010 at 02:17 PM.
Starson17 is offline