|
|||||||
![]() |
|
|
Thread Tools | Search this Thread |
|
|
#1 |
|
Member
![]() Posts: 12
Karma: 10
Join Date: Oct 2010
Location: UK
Device: Kindle 3 WiFi, Kindle Paperwhite 2013
|
Code:
import urllib2
from BeautifulSoup import BeautifulSoup
from calibre.web.feeds.news import BasicNewsRecipe
class Counterpunch(BasicNewsRecipe):
'''
Parses counterpunch.com for articles
'''
def parse_index(self):
feeds = []
title, url = 'Counterpunch', 'http://www.counterpunch.com'
articles = self.parse_page(url)
if articles:
feeds.append((title, articles))
return feeds
def parse_page(self, url):
fd = urllib2.urlopen(url)
soup = BeautifulSoup(fd, fromEncoding='iso-8859-1')
articles = []
current_date = ''
#Gets all dates and entries in the correctly dispersed way e.g. date, list of articles for date, next date, next list of articles
#first expression gets entries, second gets dates
dates_and_articles = soup.findAll(lambda tag: (tag.name == 'p' and
tag.attrs == [(u'class', u'style2')] and
len(tag) == 4 and
'Website of the' not in tag.decode('utf-8')) or
(tag.name == 'font' and
tag.attrs == [(u'color', u'#990000'), (u'size', u'-1')]))
for tag in dates_and_articles:
#if 'Today\'s\n Stories' in tag.contents:
if tag.name == 'p':
#logic to deal with different ways names are printed (color difference I belive)
if tag.find('span', {'class': 'style1'}):
author = tag.contents[0].contents[0] + ': '
url = 'http://www.counterpunch.com/' + tag.contents[3].attrs[0][1]
else:
author = tag.contents[0] + ': '
url = 'http://www.counterpunch.com/' + tag.contents[3].attrs[0][1]
title = author + str(tag.contents[3].contents[0])
articles.append({'title': title, 'url': url, 'description':'', 'date': current_date})
#if new date, update current_date
elif tag.name == 'font':
current_date = tag.contents[0]
#print('the date is {0}').format(current_date)
#cut just one days articles for clearer, quicker debugging
articles = [a for a in articles if a['date'] == 'October 11, 2010']
return articles
#for debugging on the cmd
#c = Counterpunch()
#print c.parse_index()
This is the first recipe I have written. It is for a site that has no rss. The articles are in a table at the side of the page separated by date headings. I mocked it up as a .py file first. I got it to a workable state where it will spit out a list of feeds on the commandline. I then made the few small changes to it to make it into a recipe and test with 'ebook-convert counterpunch.recipe test --test -vv' but I get the below traceback: Code:
1% Converting input to HTML...
InputFormatPlugin: Recipe Input running
1% Fetching feeds...
Traceback (most recent call last):
File "/tmp/init.py", line 48, in <module>
File "/home/kovid/build/calibre/src/calibre/ebooks/conversion/cli.py", line 254, in main
File "/home/kovid/build/calibre/src/calibre/ebooks/conversion/plumber.py", line 836, in run
File "/home/kovid/build/calibre/src/calibre/customize/conversion.py", line 216, in __call__
File "/home/kovid/build/calibre/src/calibre/web/feeds/input.py", line 105, in convert
File "/home/kovid/build/calibre/src/calibre/web/feeds/news.py", line 712, in download
File "/home/kovid/build/calibre/src/calibre/web/feeds/news.py", line 837, in build_index
File "/tmp/calibre_0.7.26_tmp_Ep1Dpi/calibre_0.7.26_IUpdj4_recipes/recipe0.py", line 15, in parse_index
articles = self.parse_page(url)
File "/tmp/calibre_0.7.26_tmp_Ep1Dpi/calibre_0.7.26_IUpdj4_recipes/recipe0.py", line 28, in parse_page
dates_and_articles = soup.findAll(lambda tag: (tag.name == 'p' and
File "/usr/lib/python2.6/site-packages/BeautifulSoup.py", line 768, in findAll
File "/usr/lib/python2.6/site-packages/BeautifulSoup.py", line 332, in _findAll
File "/usr/lib/python2.6/site-packages/BeautifulSoup.py", line 890, in search
File "/usr/lib/python2.6/site-packages/BeautifulSoup.py", line 849, in searchTag
File "/usr/lib/python2.6/site-packages/BeautifulSoup.py", line 907, in _matches
File "/tmp/calibre_0.7.26_tmp_Ep1Dpi/calibre_0.7.26_IUpdj4_recipes/recipe0.py", line 31, in <lambda>
'Website of the' not in tag.decode('utf-8')) or
TypeError: 'NoneType' object is not callable
Can anyone get it to run to grab the feeds for calibre? Thanks |
|
|
|
|
|
#2 |
|
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
I tested briefly on another machine, and got your feed parsed correctly. The articles weren't pulling, and I didn't debug why, but you were parsing the articles and building the feed from your source page just fine.
The recipe didn't finish, and I'm not sure if all you articles were parsed correctly, but most were. I started to play with it, added a postprocess_html for debugging, cleaned up some comments, added some print statements and the recipe finished, (empty articles) but that's as far as I went. I know it's not much, but I thought you might want to know you weren't ignored. |
|
|
|
| Advert | |
|
|
|
|
#3 |
|
Junior Member
![]() Posts: 4
Karma: 10
Join Date: Dec 2010
Device: Kindle 3
|
Counterpunch is a good web publication and as a calibre user I would appreciate it if its recipe gets debugged and put into the software distribution.
|
|
|
|
|
|
#4 |
|
Member
![]() Posts: 19
Karma: 10
Join Date: Jul 2010
Device: Calibre
|
It's been a year and a half since the original post. Does anyone know about any developments? I really would like to get a hold of a working recipe for CounterPunch. Thanks.
|
|
|
|
|
|
#5 |
|
Member
![]() Posts: 12
Karma: 10
Join Date: Oct 2010
Location: UK
Device: Kindle 3 WiFi, Kindle Paperwhite 2013
|
I rewrote it and got it working.
I have contributed it to Calibre. It will be included from the version released today (0.8.12). If you don't want to update you can use the file attached to this post. Enjoy! |
|
|
|
| Advert | |
|
|
|
|
#6 |
|
Member
![]() Posts: 19
Karma: 10
Join Date: Jul 2010
Device: Calibre
|
Thank you so much. So far so good! I love it!
|
|
|
|
|
|
#7 |
|
Member
![]() Posts: 19
Karma: 10
Join Date: Jul 2010
Device: Calibre
|
There seems to be a limit of 10 entries per day. Actually some days there are less than ten and some days there are more than 10. So how does that work? Is there a way to make sure that no entries are repeated and that all entries eventually get pulled off? I'm new to this, so I am not sure how it works. Thanks.
|
|
|
|
|
|
#8 |
|
Member
![]() Posts: 12
Karma: 10
Join Date: Oct 2010
Location: UK
Device: Kindle 3 WiFi, Kindle Paperwhite 2013
|
Counterpunch have redesigned their site and now have an RSS feed, making things easier for the recipe.
I have rewritten and submitted it to Calibre. It will be in the next version, which should be released next Friday (9 Sept). You can use the version I attached to this post if you want in the meantime. @aritza The new recipe has a limit of 7 days/100 posts but since it works by RSS now it is really limited by the number of posts in the feed (25 at this time.) |
|
|
|
![]() |
| Thread Tools | Search this Thread |
|
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| To MOBI, Chapter detection fails? Works for EPUB | Fmstrat | Calibre | 7 | 08-29-2010 05:37 PM |
| Help a beginner:Python/Recipe Unicode and ASCII | Starson17 | Calibre | 2 | 02-15-2010 11:10 AM |
| NY Times Recipe in Calibre 6.36 Fails | keyrunner | Calibre | 1 | 01-28-2010 11:56 AM |
| Is it possible to specify output format in recipe file | madcow_x2 | Calibre | 3 | 01-07-2010 04:10 PM |
| Recipe works from 1 machine, not from another | BarryTX | Calibre | 12 | 07-18-2009 12:31 AM |