MobileRead Forums - View Single Post - Custom recipes (archive, read-only)

jbambridge · 08-06-2009, 05:47 AM

Problem parsing guardian rss feed:

I have tried to update the Guardian Recipe to fix some problems with changes in the web site etc. I am almost there, but I am hitting the odd article that causes the following errors in ebook-convert:

Quote:

Parsing feed_0/article_7/index.html ...
Traceback (most recent call last):
File "cli.py", line 254, in <module>
File "cli.py", line 246, in main
File "calibre\ebooks\conversion\plumber.pyo", line 657, in run
File "calibre\ebooks\conversion\plumber.pyo", line 761, in create_oebbook
File "calibre\ebooks\oeb\reader.pyo", line 72, in __call__
File "calibre\ebooks\oeb\reader.pyo", line 588, in _all_from_opf
File "calibre\ebooks\oeb\reader.pyo", line 243, in _manifest_from_opf
File "calibre\ebooks\oeb\reader.pyo", line 176, in _manifest_add_missing
File "calibre\ebooks\oeb\base.pyo", line 988, in fget
File "calibre\ebooks\oeb\base.pyo", line 917, in _parse_xhtml
TypeError: unsupported operand type(s) for +=: 'NoneType' and 'str'

The modified recipe is as follows:

Quote:

#!/usr/bin/env python
__license__ = 'GPL v3'
__copyright__ = '2008, Kovid Goyal kovid@kovidgoyal.net'
__docformat__ = 'restructuredtext en'

'''
www.guardian.co.uk
'''

from calibre.web.feeds.news import BasicNewsRecipe

class Guardian(BasicNewsRecipe):

title = u'My Guardian'
language = _('English')
encoding = 'utf-8'
oldest_article = 7
max_articles_per_feed = 20
remove_javascript = True
simultaneous_downloads = 1
use_embedded_content = False
recursions = 0
filter_regexps = [r'\.g\.doubleclick\.net']

timefmt = ' [%a, %d %b %Y]'

keep_only_tags = [dict(id=['article-wrapper', 'main-article-info'])]

no_stylesheets = True
extra_css = 'h2 {font-size: medium;} \n h1 {text-align: left;}'

feeds = [
('Front Page', 'http://feeds.guardian.co.uk/theguardian/rss'),
# ('UK', 'http://feeds.guardian.co.uk/theguardian/uk/rss'),
# ('Business', 'http://www.guardian.co.uk/business/rss'),
# ('Politics', 'http://feeds.guardian.co.uk/theguardian/politics/rss'),
# ('Culture', 'http://feeds.guardian.co.uk/theguardian/culture/rss'),
# ('Money', 'http://feeds.guardian.co.uk/theguardian/money/rss'),
# ('Life & Style', 'http://feeds.guardian.co.uk/theguardian/lifeandstyle/rss'),
# ('Travel', 'http://feeds.guardian.co.uk/theguardian/travel/rss'),
# ('Environment', 'http://feeds.guardian.co.uk/theguardian/environment/rss')
]

def print_version(self, url):
return url + '/print'

Any ideas what the error means?

John