MobileRead Forums - View Single Post

issproevolution · 01-22-2014, 09:17 AM

Hi all!
in first, sorry for my poor english..
I would like to improve one recipes from Calibre for add some information and image to books..
I have one bigger problem: sometimes Calibre doesn't work and it return me this for only one (or two) article:

Code:

HTML 5 parsing failed, falling back to older parsers
Traceback (most recent call last):
  File "/usr/lib/calibre/calibre/ebooks/oeb/parse_utils.py", line 277, in parse_html
    data = html5_parse(data)
  File "/usr/lib/calibre/calibre/ebooks/oeb/parse_utils.py", line 98, in html5_parse
    data = html5lib.parse(clean_xml_chars(data), treebuilder='lxml').getroot()
  File "/usr/lib/calibre/html5lib/html5parser.py", line 27, in parse
    return p.parse(doc, encoding=encoding)
  File "/usr/lib/calibre/html5lib/html5parser.py", line 227, in parse
    parseMeta=parseMeta, useChardet=useChardet)
  File "/usr/lib/calibre/html5lib/html5parser.py", line 96, in _parse
    self.mainLoop()
  File "/usr/lib/calibre/html5lib/html5parser.py", line 162, in mainLoop
    currentNodeName = currentNode.name if currentNode is not None else None
  File "/usr/lib/calibre/html5lib/treebuilders/etree_lxml.py", line 226, in _getName
    return infosetFilter.fromXmlName(self._name)
  File "/usr/lib/calibre/html5lib/ihatexml.py", line 276, in fromXmlName
    name = name.replace(item, self.unescapeChar(item))
  File "/usr/lib/calibre/html5lib/ihatexml.py", line 285, in unescapeChar
    return chr(int(charcode[1:], 16))
ValueError: chr() arg not in range(256)

but it happen randomly!
if I restart recepies, it'll happen to another article! it's strange, isn't it?

my another goal is add image to book, but it doesn't appears

thank you so much for support!!
best regards

---

added info:
- rss: http://www.ilfattoquotidiano.it/cate...-palazzo/feed/
- code:

Code:

from calibre.web.feeds.news import BasicNewsRecipe

class IlFattoQuotidianoDiISP(BasicNewsRecipe):
    title          = u'Il fatto quotidiano ISP'
    oldest_article = 2
    max_articles_per_feed = 5
    auto_cleanup = True
    language = 'it'
    __author__ = 'isspro'
    encoding = 'utf8'

    no_stylesheets = True
    use_embedded_content = False
    remove_javascript  = True
    auto_cleanup = False
    
    keep_only_tags     = [dict(name='div', attrs={'class':'post-content-container'}),
    					  dict(name='div', attrs={'id':'meta-bar'})
    					  
    ]
    
    remove_tags = [
    				dict(name='div', attrs={'id':'commenti'}),
    				dict(name='div', attrs={'class':'post-tags'})
    ]
    
    extra_css = '''
    		h1 {font-size:x-large;}
    		h2 {font-size:medium;}
    		post-tags {font-size:xx-small;}
    		img {display:block;}
    '''
 
    feeds          = [(u'Politica & Palazzo', u'http://www.ilfattoquotidiano.it/category/politica-palazzo/feed/')]