MobileRead Forums - View Single Post - New Recipe

nanodreams · 03-12-2015, 04:09 PM

Hi Guys!

I´d like to ask for help as I expended hours trying different approaches to get the needed content and I can´t.

At the end .. the best approach was using the auto_cleanup option that detects perfectly what I want except that is removing the photo of the news.

The RSS I´d like to parse is:

http://www.diariodeburgos.es/rss/DBPortada.xml

I´m using the following code:

Code:

        import time
        from calibre.ptempfile import PersistentTemporaryFile
        from calibre.web.feeds.news import BasicNewsRecipe


        class DiarioDeBurgos(BasicNewsRecipe):
            title          = u'Diario de Burgos'
            oldest_article = 1
            max_articles_per_feed = 10
            ignore_duplicate_articles = {'url'}
            use_embedded_content = False
            no_stylesheets = True
            auto_cleanup = True

            feeds          = [
                                (u'Portada', u'http://www.diariodeburgos.es/rss/DBPortada.xml'),
                             ]
            def get_cover_url(self):
               return  'http://i.promecal.es/Portadas/DB-G.jpg'

I tried to use the command 'auto_cleanup_keep', but it seems that it´s not working for me. I´d like to keep the div called `divImgNoticia0` and the tag looks like

<div id="divImgNoticia0" class="GaleriaNoticiaFoto" ...

I tried the following code but no luck:

auto_cleanup_keep = '//div[@id="divImgNoticia0"]'

I´d really appreciate if someone could help me to identify what I´m doing wrong. It seems that the command auto_cleanup_keep is easy to use ... but not working somehow.

The idea is to keep only the tags

<div class="Titular">
<span id="ctl00_cph2Columnas_lblTextoNoticia">
<div id="divImgNoticia0" class="GaleriaNoticiaFoto" style="cursor

ointer;cursor:hand">

I tried also to use the command 'keep_only_tags' but not luck neither .. in this case the element 'ctl00_cph2Columnas_lblTextoNoticia' is not being added.

Many thanks in advanced for your help and time.

Regards,
Nano.

03-12-2015, 04:09 PM	#1
nanodreams Junior Member Posts: 1 Karma: 10 Join Date: Mar 2015 Device: Kindle	New Recipe - www.diariodeburgos.es Hi Guys! I´d like to ask for help as I expended hours trying different approaches to get the needed content and I can´t. At the end .. the best approach was using the auto_cleanup option that detects perfectly what I want except that is removing the photo of the news. The RSS I´d like to parse is: http://www.diariodeburgos.es/rss/DBPortada.xml I´m using the following code: Code: import time from calibre.ptempfile import PersistentTemporaryFile from calibre.web.feeds.news import BasicNewsRecipe class DiarioDeBurgos(BasicNewsRecipe): title = u'Diario de Burgos' oldest_article = 1 max_articles_per_feed = 10 ignore_duplicate_articles = {'url'} use_embedded_content = False no_stylesheets = True auto_cleanup = True feeds = [ (u'Portada', u'http://www.diariodeburgos.es/rss/DBPortada.xml'), ] def get_cover_url(self): return 'http://i.promecal.es/Portadas/DB-G.jpg' I tried to use the command 'auto_cleanup_keep', but it seems that it´s not working for me. I´d like to keep the div called `divImgNoticia0` and the tag looks like <div id="divImgNoticia0" class="GaleriaNoticiaFoto" ... I tried the following code but no luck: auto_cleanup_keep = '//div[@id="divImgNoticia0"]' I´d really appreciate if someone could help me to identify what I´m doing wrong. It seems that the command auto_cleanup_keep is easy to use ... but not working somehow. The idea is to keep only the tags <div class="Titular"> <span id="ctl00_cph2Columnas_lblTextoNoticia"> <div id="divImgNoticia0" class="GaleriaNoticiaFoto" style="cursorointer;cursor:hand"> I tried also to use the command 'keep_only_tags' but not luck neither .. in this case the element 'ctl00_cph2Columnas_lblTextoNoticia' is not being added. Many thanks in advanced for your help and time. Regards, Nano. Last edited by PeterT; 03-12-2015 at 06:09 PM. Reason: Editted to include [code] . [/code] to make the script easier to read