View Single Post
Old 09-23-2011, 02:16 AM   #8
a.peter
Enthusiast
a.peter began at the beginning.
 
Posts: 28
Karma: 10
Join Date: Sep 2011
Device: Sony PRS-350, Kindle Touch
Quote:
Originally Posted by macpablus View Post
The problem is that my complete recipe has other feeds (i.e, the content of the whole newspaper, with many different sections and articles), so the option keep_only_tags will affect each of the articles.
It's clear to me, that my recipe isn't complete. It was done to show you, that Calibre is expecting a HTML-page as URL. You passed the address of a GIF-image to calibre, which was interpredet as a HTML-page an produced the character garbage you've seen.

The good point is, that the keep_only_tags member is a list of dictionaries. You may add any other expression you need to parse other pages. If i take a look at an article, e. g. http://www.pagina12.com.ar/diario/el...011-09-22.html, i see that the actual article is embedded into a <div class="nota top12"> tag.

A modified keep_only_tags may be:

Code:
keep_only_tags = [dict(name='div', attrs={'id':'rudy_paz'}), dict(name='div', attrs={'class':'nota top12'})]
With this code, calibre will keep
  • all <div> with id="rudy_paz' AND
  • all <div> with class='nota top12'

It's no matter if they dont appear on the same page. But if you pass one page with the comic strip and a list of pages with articles, it will work on both of them.

By the way: For convenience, you may replace the second part of a dictionary entry of the keep_only_tags by a compiled regular expression, e. g. attrs={'class':re.compile('top.*')}

But don't forget to add a
Code:
import re
at the top of the recipe.
a.peter is offline   Reply With Quote