View Single Post
Old 11-05-2010, 01:29 AM   #1
ode
Member
ode began at the beginning.
 
Posts: 12
Karma: 10
Join Date: Oct 2010
Location: UK
Device: Kindle 3 WiFi, Kindle Paperwhite 2013
Question Recipe works when mocked up as Python file, fails when converted to Recipe

Code:
import urllib2
from BeautifulSoup import BeautifulSoup
from calibre.web.feeds.news import BasicNewsRecipe

class Counterpunch(BasicNewsRecipe):
    '''
    Parses counterpunch.com for articles
    '''  
    def parse_index(self):
		feeds = []
		title, url = 'Counterpunch', 'http://www.counterpunch.com'
		articles = self.parse_page(url)
		if articles:
			feeds.append((title, articles))
		return feeds
			
			
    def parse_page(self, url):
        fd = urllib2.urlopen(url)
        soup = BeautifulSoup(fd, fromEncoding='iso-8859-1') 
        articles = []
        current_date = ''
        #Gets all dates and entries in the correctly dispersed way e.g. date, list of articles for date, next date, next list of articles
        #first expression gets entries, second gets dates
        dates_and_articles = soup.findAll(lambda tag: (tag.name == 'p' and
                                          tag.attrs == [(u'class', u'style2')] and
                                          len(tag) == 4 and
                                          'Website of the' not in tag.decode('utf-8')) or
                                          (tag.name == 'font' and
                                          tag.attrs == [(u'color', u'#990000'), (u'size', u'-1')]))
        for tag in dates_and_articles:
            #if 'Today\'s\n Stories' in tag.contents:
            if tag.name == 'p':
                #logic to deal with different ways names are printed (color difference I belive)
                if tag.find('span', {'class': 'style1'}):
                    author = tag.contents[0].contents[0] + ': '
                    url = 'http://www.counterpunch.com/' + tag.contents[3].attrs[0][1]
                else:
                    author = tag.contents[0] + ': '
                    url = 'http://www.counterpunch.com/' + tag.contents[3].attrs[0][1]
                title = author + str(tag.contents[3].contents[0])
                articles.append({'title': title, 'url': url, 'description':'', 'date': current_date})
            #if new date, update current_date
            elif tag.name == 'font':
                current_date = tag.contents[0]
                #print('the date is {0}').format(current_date)
        #cut just one days articles for clearer, quicker debugging
        articles = [a for a in articles if a['date'] == 'October 11, 2010']
        return articles
            
#for debugging on the cmd             
#c = Counterpunch()
#print c.parse_index()

This is the first recipe I have written.
It is for a site that has no rss. The articles are in a table at the side of the page separated by date headings.
I mocked it up as a .py file first. I got it to a workable state where it will spit out a list of feeds on the commandline.
I then made the few small changes to it to make it into a recipe and test with 'ebook-convert counterpunch.recipe test --test -vv' but I get the below traceback:


Code:
1% Converting input to HTML...
InputFormatPlugin: Recipe Input running
1% Fetching feeds...
Traceback (most recent call last):
  File "/tmp/init.py", line 48, in <module>
  File "/home/kovid/build/calibre/src/calibre/ebooks/conversion/cli.py", line 254, in main
  File "/home/kovid/build/calibre/src/calibre/ebooks/conversion/plumber.py", line 836, in run
  File "/home/kovid/build/calibre/src/calibre/customize/conversion.py", line 216, in __call__
  File "/home/kovid/build/calibre/src/calibre/web/feeds/input.py", line 105, in convert
  File "/home/kovid/build/calibre/src/calibre/web/feeds/news.py", line 712, in download
  File "/home/kovid/build/calibre/src/calibre/web/feeds/news.py", line 837, in build_index
  File "/tmp/calibre_0.7.26_tmp_Ep1Dpi/calibre_0.7.26_IUpdj4_recipes/recipe0.py", line 15, in parse_index
    articles = self.parse_page(url)
  File "/tmp/calibre_0.7.26_tmp_Ep1Dpi/calibre_0.7.26_IUpdj4_recipes/recipe0.py", line 28, in parse_page
    dates_and_articles = soup.findAll(lambda tag: (tag.name == 'p' and
  File "/usr/lib/python2.6/site-packages/BeautifulSoup.py", line 768, in findAll
  File "/usr/lib/python2.6/site-packages/BeautifulSoup.py", line 332, in _findAll
  File "/usr/lib/python2.6/site-packages/BeautifulSoup.py", line 890, in search
  File "/usr/lib/python2.6/site-packages/BeautifulSoup.py", line 849, in searchTag
  File "/usr/lib/python2.6/site-packages/BeautifulSoup.py", line 907, in _matches
  File "/tmp/calibre_0.7.26_tmp_Ep1Dpi/calibre_0.7.26_IUpdj4_recipes/recipe0.py", line 31, in <lambda>
    'Website of the' not in tag.decode('utf-8')) or
TypeError: 'NoneType' object is not callable
I assumed it has something to do with the decode method. I have played with this for hours and sometimes have changed it to make this traceback different but still get no feeds when the same code, but when called directly on the cmdline it will give me the feeds I need with no problem.

Can anyone get it to run to grab the feeds for calibre?

Thanks
ode is offline   Reply With Quote