Code:
import urllib2
from BeautifulSoup import BeautifulSoup
from calibre.web.feeds.news import BasicNewsRecipe
class Counterpunch(BasicNewsRecipe):
'''
Parses counterpunch.com for articles
'''
def parse_index(self):
feeds = []
title, url = 'Counterpunch', 'http://www.counterpunch.com'
articles = self.parse_page(url)
if articles:
feeds.append((title, articles))
return feeds
def parse_page(self, url):
fd = urllib2.urlopen(url)
soup = BeautifulSoup(fd, fromEncoding='iso-8859-1')
articles = []
current_date = ''
#Gets all dates and entries in the correctly dispersed way e.g. date, list of articles for date, next date, next list of articles
#first expression gets entries, second gets dates
dates_and_articles = soup.findAll(lambda tag: (tag.name == 'p' and
tag.attrs == [(u'class', u'style2')] and
len(tag) == 4 and
'Website of the' not in tag.decode('utf-8')) or
(tag.name == 'font' and
tag.attrs == [(u'color', u'#990000'), (u'size', u'-1')]))
for tag in dates_and_articles:
#if 'Today\'s\n Stories' in tag.contents:
if tag.name == 'p':
#logic to deal with different ways names are printed (color difference I belive)
if tag.find('span', {'class': 'style1'}):
author = tag.contents[0].contents[0] + ': '
url = 'http://www.counterpunch.com/' + tag.contents[3].attrs[0][1]
else:
author = tag.contents[0] + ': '
url = 'http://www.counterpunch.com/' + tag.contents[3].attrs[0][1]
title = author + str(tag.contents[3].contents[0])
articles.append({'title': title, 'url': url, 'description':'', 'date': current_date})
#if new date, update current_date
elif tag.name == 'font':
current_date = tag.contents[0]
#print('the date is {0}').format(current_date)
#cut just one days articles for clearer, quicker debugging
articles = [a for a in articles if a['date'] == 'October 11, 2010']
return articles
#for debugging on the cmd
#c = Counterpunch()
#print c.parse_index()
This is the first recipe I have written.
It is for a site that has no rss. The articles are in a table at the side of the page separated by date headings.
I mocked it up as a .py file first. I got it to a workable state where it will spit out a list of feeds on the commandline.
I then made the few small changes to it to make it into a recipe and test with 'ebook-convert counterpunch.recipe test --test -vv' but I get the below traceback:
Code:
1% Converting input to HTML...
InputFormatPlugin: Recipe Input running
1% Fetching feeds...
Traceback (most recent call last):
File "/tmp/init.py", line 48, in <module>
File "/home/kovid/build/calibre/src/calibre/ebooks/conversion/cli.py", line 254, in main
File "/home/kovid/build/calibre/src/calibre/ebooks/conversion/plumber.py", line 836, in run
File "/home/kovid/build/calibre/src/calibre/customize/conversion.py", line 216, in __call__
File "/home/kovid/build/calibre/src/calibre/web/feeds/input.py", line 105, in convert
File "/home/kovid/build/calibre/src/calibre/web/feeds/news.py", line 712, in download
File "/home/kovid/build/calibre/src/calibre/web/feeds/news.py", line 837, in build_index
File "/tmp/calibre_0.7.26_tmp_Ep1Dpi/calibre_0.7.26_IUpdj4_recipes/recipe0.py", line 15, in parse_index
articles = self.parse_page(url)
File "/tmp/calibre_0.7.26_tmp_Ep1Dpi/calibre_0.7.26_IUpdj4_recipes/recipe0.py", line 28, in parse_page
dates_and_articles = soup.findAll(lambda tag: (tag.name == 'p' and
File "/usr/lib/python2.6/site-packages/BeautifulSoup.py", line 768, in findAll
File "/usr/lib/python2.6/site-packages/BeautifulSoup.py", line 332, in _findAll
File "/usr/lib/python2.6/site-packages/BeautifulSoup.py", line 890, in search
File "/usr/lib/python2.6/site-packages/BeautifulSoup.py", line 849, in searchTag
File "/usr/lib/python2.6/site-packages/BeautifulSoup.py", line 907, in _matches
File "/tmp/calibre_0.7.26_tmp_Ep1Dpi/calibre_0.7.26_IUpdj4_recipes/recipe0.py", line 31, in <lambda>
'Website of the' not in tag.decode('utf-8')) or
TypeError: 'NoneType' object is not callable
I assumed it has something to do with the decode method. I have played with this for hours and sometimes have changed it to make this traceback different but still get no feeds when the same code, but when called directly on the cmdline it will give me the feeds I need with no problem.
Can anyone get it to run to grab the feeds for calibre?
Thanks