Hi Guys!
I´d like to ask for help as I expended hours trying different approaches to get the needed content and I can´t.
At the end .. the best approach was using the auto_cleanup option that detects perfectly what I want except that is removing the photo of the news.
The RSS I´d like to parse is:
http://www.diariodeburgos.es/rss/DBPortada.xml
I´m using the following code:
Code:
import time
from calibre.ptempfile import PersistentTemporaryFile
from calibre.web.feeds.news import BasicNewsRecipe
class DiarioDeBurgos(BasicNewsRecipe):
title = u'Diario de Burgos'
oldest_article = 1
max_articles_per_feed = 10
ignore_duplicate_articles = {'url'}
use_embedded_content = False
no_stylesheets = True
auto_cleanup = True
feeds = [
(u'Portada', u'http://www.diariodeburgos.es/rss/DBPortada.xml'),
]
def get_cover_url(self):
return 'http://i.promecal.es/Portadas/DB-G.jpg'
I tried to use the command 'auto_cleanup_keep', but it seems that it´s not working for me. I´d like to keep the div called `divImgNoticia0` and the tag looks like
<div id="divImgNoticia0" class="GaleriaNoticiaFoto" ...
I tried the following code but no luck:
auto_cleanup_keep = '//div[@id="divImgNoticia0"]'
I´d really appreciate if someone could help me to identify what I´m doing wrong. It seems that the command auto_cleanup_keep is easy to use ... but not working somehow.
The idea is to keep only the tags
<div class="Titular">
<span id="ctl00_cph2Columnas_lblTextoNoticia">
<div id="divImgNoticia0" class="GaleriaNoticiaFoto" style="cursor

ointer;cursor:hand">
I tried also to use the command 'keep_only_tags' but not luck neither .. in this case the element 'ctl00_cph2Columnas_lblTextoNoticia' is not being added.
Many thanks in advanced for your help and time.
Regards,
Nano.