View Single Post
Old 09-14-2010, 07:53 PM   #2725
bhandarisaurabh
Enthusiast
bhandarisaurabh began at the beginning.
 
Posts: 49
Karma: 10
Join Date: Aug 2009
Device: none
Smile

Quote:
Originally Posted by TonytheBookworm View Post
Alright here is the thing. That site has TOOOOOOONS of articles. The code below should work now. but what it does it goes through the links and starts with at the top down. I have the max article set to 50. So you will get 50 articles max and then it will stop. If you want 3000 then put in 3000 and hope for the best

There might very well be a more effective method of doing this. I personally do not know it. Secondly, someone with more knowledge than I do might know how to group it by the actual dates. I tested this on my end with the current code and received 50 unique articles for starting at the earliest one being in 9-15 2010

I have pretty much done all I know how at this point to do on this recipe and consider it "working but hopping along" if anyone else cares to take a stab at it. If you get it working 100 percent please share so I can learn from it.

Spoiler:

Code:
from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup, re
class FIELDSTREAM(BasicNewsRecipe):
    title      = 'Down To Earth Archive'
    __author__ = 'Tonythebookworm'
    description = ''
    language = 'en'
    no_stylesheets = True
    publisher           = 'Tonythebookworm'
    category            = ''
    use_embedded_content= False
    no_stylesheets      = True
    oldest_article      = 365
    remove_javascript   = True
    remove_empty_feeds  = True
    masthead_url        = 'http://downtoearth.org.in/themes/DTE/images/DownToEarth-Logo.gif'
    
    
    max_articles_per_feed = 50 # only gets the first 50 articles
    INDEX = 'http://downtoearth.org.in'
    
    #I HAVE LEFT THE PRINT STATEMENTS IN HERE FOR DEBUGGING PURPOSES
    #Fill free to remove them.
    #This will only parse the 2010 archives.  The other ones can be added and SHOULD work.
    
    def parse_index(self):
        feeds = []
        for title, url in [
                            (u"2010 Archives", u"http://downtoearth.org.in/archives/2010"),
                             ]:
            articles = self.make_links(url)
            if articles:
                feeds.append((title, articles))
        return feeds
        
    def make_links(self, url):
        title = 'Temp'
        current_articles = []
        soup = self.index_to_soup(url)
        #print 'The soup is: ', soup
        for item in soup.findAll('div',attrs={'class':'views-field-nothing-2'}):
         # print 'item is: ', item
         
         link = item.find('a')
         linkhref = link['href']
         split1 = linkhref.split("/")
         date  = split1[3]
         print 'DATE IS :', date
         print 'the link is: ', link
            
            
         if link:
          url         = self.INDEX + link['href']
                
          soup = self.index_to_soup(url) 
          #print 'NEW SOUP IS: ', soup
        
           
         for items in soup.findAll('div',attrs={'id':'PageContent'}):
          for nodes in items.findAll('a', href=re.compile('/node')):
            
            
            
            if nodes is not None and not re.search('Next Issue', str(nodes)) and not re.search('Previous Issue', str(nodes)):
             
                print 'LINK2 EX!!! and here is that link: ', nodes['href']
                url         = nodes['href']
                
                title       = self.tag_to_string(nodes)
                
                print 'the title is: ', title
                print 'the url is: ', url
                print 'the title is: ', title
                current_articles.append({'title': date + '--' + title, 'url': url, 'description':'', 'date':''}) # append all this
            
        return current_articles
      

   
    def print_version(self, url):
        split1 = url.split("/")
        print 'THE SPLIT IS: ', split1        
        print_url = 'http://downtoearth.org.in/print' + '/' + split1[2]
        print 'THIS URL WILL PRINT: ', print_url # this is a test string to see what the url is it will return
        return print_url
thanks it worked like a charm it just fetched 2 extra articles from the past issue rest was fine
bhandarisaurabh is offline