Quote:
Originally Posted by TonytheBookworm
Alright here is the thing. That site has TOOOOOOONS of articles. The code below should work now. but what it does it goes through the links and starts with at the top down. I have the max article set to 50. So you will get 50 articles max and then it will stop. If you want 3000 then put in 3000 and hope for the best
There might very well be a more effective method of doing this. I personally do not know it. Secondly, someone with more knowledge than I do might know how to group it by the actual dates. I tested this on my end with the current code and received 50 unique articles for starting at the earliest one being in 9-15 2010
I have pretty much done all I know how at this point to do on this recipe and consider it "working but hopping along" if anyone else cares to take a stab at it. If you get it working 100 percent please share so I can learn from it.
Spoiler:
Code:
from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup, re
class FIELDSTREAM(BasicNewsRecipe):
title = 'Down To Earth Archive'
__author__ = 'Tonythebookworm'
description = ''
language = 'en'
no_stylesheets = True
publisher = 'Tonythebookworm'
category = ''
use_embedded_content= False
no_stylesheets = True
oldest_article = 365
remove_javascript = True
remove_empty_feeds = True
masthead_url = 'http://downtoearth.org.in/themes/DTE/images/DownToEarth-Logo.gif'
max_articles_per_feed = 50 # only gets the first 50 articles
INDEX = 'http://downtoearth.org.in'
#I HAVE LEFT THE PRINT STATEMENTS IN HERE FOR DEBUGGING PURPOSES
#Fill free to remove them.
#This will only parse the 2010 archives. The other ones can be added and SHOULD work.
def parse_index(self):
feeds = []
for title, url in [
(u"2010 Archives", u"http://downtoearth.org.in/archives/2010"),
]:
articles = self.make_links(url)
if articles:
feeds.append((title, articles))
return feeds
def make_links(self, url):
title = 'Temp'
current_articles = []
soup = self.index_to_soup(url)
#print 'The soup is: ', soup
for item in soup.findAll('div',attrs={'class':'views-field-nothing-2'}):
# print 'item is: ', item
link = item.find('a')
linkhref = link['href']
split1 = linkhref.split("/")
date = split1[3]
print 'DATE IS :', date
print 'the link is: ', link
if link:
url = self.INDEX + link['href']
soup = self.index_to_soup(url)
#print 'NEW SOUP IS: ', soup
for items in soup.findAll('div',attrs={'id':'PageContent'}):
for nodes in items.findAll('a', href=re.compile('/node')):
if nodes is not None and not re.search('Next Issue', str(nodes)) and not re.search('Previous Issue', str(nodes)):
print 'LINK2 EX!!! and here is that link: ', nodes['href']
url = nodes['href']
title = self.tag_to_string(nodes)
print 'the title is: ', title
print 'the url is: ', url
print 'the title is: ', title
current_articles.append({'title': date + '--' + title, 'url': url, 'description':'', 'date':''}) # append all this
return current_articles
def print_version(self, url):
split1 = url.split("/")
print 'THE SPLIT IS: ', split1
print_url = 'http://downtoearth.org.in/print' + '/' + split1[2]
print 'THIS URL WILL PRINT: ', print_url # this is a test string to see what the url is it will return
return print_url
|
thanks it worked like a charm it just fetched 2 extra articles from the past issue rest was fine