MobileRead Forums - View Single Post - Adding a comic strip to a newspaper's recipe

a.peter · 09-23-2011, 02:16 AM

Quote:

Originally Posted by macpablus

The problem is that my complete recipe has other feeds (i.e, the content of the whole newspaper, with many different sections and articles), so the option keep_only_tags will affect each of the articles.

It's clear to me, that my recipe isn't complete. It was done to show you, that Calibre is expecting a HTML-page as URL. You passed the address of a GIF-image to calibre, which was interpredet as a HTML-page an produced the character garbage you've seen.

The good point is, that the keep_only_tags member is a list of dictionaries. You may add any other expression you need to parse other pages. If i take a look at an article, e. g. http://www.pagina12.com.ar/diario/el...011-09-22.html, i see that the actual article is embedded into a <div class="nota top12"> tag.

A modified keep_only_tags may be:

Code:

keep_only_tags = [dict(name='div', attrs={'id':'rudy_paz'}), dict(name='div', attrs={'class':'nota top12'})]

With this code, calibre will keep

all <div> with id="rudy_paz' AND
all <div> with class='nota top12'

It's no matter if they dont appear on the same page. But if you pass one page with the comic strip and a list of pages with articles, it will work on both of them.

By the way: For convenience, you may replace the second part of a dictionary entry of the keep_only_tags by a compiled regular expression, e. g. attrs={'class':re.compile('top.*')}

But don't forget to add a

Code:

import re

at the top of the recipe.