Hi all!
in first, sorry for my poor english..
I would like to improve one recipes from Calibre for add some information and image to books..
I have one bigger problem: sometimes Calibre doesn't work and it return me this for only one (or two) article:
Code:
HTML 5 parsing failed, falling back to older parsers
Traceback (most recent call last):
File "/usr/lib/calibre/calibre/ebooks/oeb/parse_utils.py", line 277, in parse_html
data = html5_parse(data)
File "/usr/lib/calibre/calibre/ebooks/oeb/parse_utils.py", line 98, in html5_parse
data = html5lib.parse(clean_xml_chars(data), treebuilder='lxml').getroot()
File "/usr/lib/calibre/html5lib/html5parser.py", line 27, in parse
return p.parse(doc, encoding=encoding)
File "/usr/lib/calibre/html5lib/html5parser.py", line 227, in parse
parseMeta=parseMeta, useChardet=useChardet)
File "/usr/lib/calibre/html5lib/html5parser.py", line 96, in _parse
self.mainLoop()
File "/usr/lib/calibre/html5lib/html5parser.py", line 162, in mainLoop
currentNodeName = currentNode.name if currentNode is not None else None
File "/usr/lib/calibre/html5lib/treebuilders/etree_lxml.py", line 226, in _getName
return infosetFilter.fromXmlName(self._name)
File "/usr/lib/calibre/html5lib/ihatexml.py", line 276, in fromXmlName
name = name.replace(item, self.unescapeChar(item))
File "/usr/lib/calibre/html5lib/ihatexml.py", line 285, in unescapeChar
return chr(int(charcode[1:], 16))
ValueError: chr() arg not in range(256)
but it happen randomly!
if I restart recepies, it'll happen to another article! it's strange, isn't it?
my another goal is add image to book, but it doesn't appears
thank you so much for support!!
best regards
---
added info:
- rss:
http://www.ilfattoquotidiano.it/cate...-palazzo/feed/
- code:
Code:
from calibre.web.feeds.news import BasicNewsRecipe
class IlFattoQuotidianoDiISP(BasicNewsRecipe):
title = u'Il fatto quotidiano ISP'
oldest_article = 2
max_articles_per_feed = 5
auto_cleanup = True
language = 'it'
__author__ = 'isspro'
encoding = 'utf8'
no_stylesheets = True
use_embedded_content = False
remove_javascript = True
auto_cleanup = False
keep_only_tags = [dict(name='div', attrs={'class':'post-content-container'}),
dict(name='div', attrs={'id':'meta-bar'})
]
remove_tags = [
dict(name='div', attrs={'id':'commenti'}),
dict(name='div', attrs={'class':'post-tags'})
]
extra_css = '''
h1 {font-size:x-large;}
h2 {font-size:medium;}
post-tags {font-size:xx-small;}
img {display:block;}
'''
feeds = [(u'Politica & Palazzo', u'http://www.ilfattoquotidiano.it/category/politica-palazzo/feed/')]