11-19-2011, 03:11 PM | #1 |
Member
Posts: 22
Karma: 12
Join Date: Feb 2009
Location: Zaragoza, Spain
Device: prs-505, iliad
|
New version for "Expansion" recipe (now work correctly) [Spanish]
Hi there:
Recipe for "Expansion.com" isn't mine, but it didn't work and I fix it. The changes are: - Changed "feedsportal" rss feeds to Expansion equivalents - To clear advertising to recover the news, and always recovers all website articles, I send the url with variable "t" reporting "linux" or epoch time for the website believes that just showed above the advertising - Code to present the image of the videos embedded - Get the cover - Delete duplicate articles if they are in more than one feed. - Update "remove_tags" - Add "extra_css" Code:
#!/usr/bin/env python __license__ = 'GPL v3' __copyright__ = '5, January 2011 Gerardo Diez<gerardo.diez.garcia@gmail.com> & desUBIKado' __author__ = 'desUBIKado, based on an earlier version by Gerardo Diez' __version__ = 'v1.01' __date__ = '13, November 2011' ''' http://www.expansion.com/ ''' import time import re from calibre.web.feeds.recipes import BasicNewsRecipe class expansion_spanish(BasicNewsRecipe): __author__ ='Gerardo Diez & desUBIKado' description ='Financial news from Spain' title =u'Expansion' publisher =u'Unidad Editorial Internet, S.L.' category ='news, finances, Spain' oldest_article = 2 simultaneous_downloads = 10 max_articles_per_feed =100 timefmt = '[%a, %d %b, %Y]' encoding ='iso-8859-15' language ='es' use_embedded_content = False remove_javascript = True no_stylesheets = True remove_empty_feeds = True keep_only_tags =dict(name='div', attrs={'class':['noticia primer_elemento']}) remove_tags =[ dict(name='div', attrs={'class':['compartir', 'metadata_desarrollo_noticia', 'relacionadas', 'mas_info','publicidad publicidad_textlink', 'ampliarfoto','tit_relacionadas','interact','paginacion estirar','sumario derecha']}), dict(name='ul', attrs={'class':['bolos_desarrollo_noticia','not_logged']}), dict(name='span', attrs={'class':['comentarios']}), dict(name='p', attrs={'class':['cintillo_comentarios', 'cintillo_comentarios formulario']}), dict(name='div', attrs={'id':['comentarios_lectores_listado','comentar']}) ] feeds =[ (u'Portada', u'http://estaticos.expansion.com/rss/portada.xml'), (u'Portada: Bolsas', u'http://estaticos.expansion.com/rss/mercados.xml'), (u'Divisas', u'http://estaticos.expansion.com/rss/mercadosdivisas.xml'), (u'Euribor', u'http://estaticos.expansion.com/rss/mercadoseuribor.xml'), (u'Materias Primas', u'http://estaticos.expansion.com/rss/mercadosmateriasprimas.xml'), (u'Renta Fija', u'http://estaticos.expansion.com/rss/mercadosrentafija.xml'), (u'Portada: Mi Dinero', u'http://estaticos.expansion.com/rss/midinero.xml'), (u'Hipotecas', u'http://estaticos.expansion.com/rss/midinerohipotecas.xml'), (u'Cr\xe9ditos', u'http://estaticos.expansion.com/rss/midinerocreditos.xml'), (u'Pensiones', u'http://estaticos.expansion.com/rss/midineropensiones.xml'), (u'Fondos de Inversi\xf3n', u'http://estaticos.expansion.com/rss/midinerofondos.xml'), (u'Motor', u'http://estaticos.expansion.com/rss/midineromotor.xml'), (u'Portada: Empresas', u'http://estaticos.expansion.com/rss/empresas.xml'), (u'Banca', u'http://estaticos.expansion.com/rss/empresasbanca.xml'), (u'TMT', u'http://estaticos.expansion.com/rss/empresastmt.xml'), (u'Energ\xeda', u'http://estaticos.expansion.com/rss/empresasenergia.xml'), (u'Inmobiliario y Construcci\xf3n', u'http://estaticos.expansion.com/rss/empresasinmobiliario.xml'), (u'Transporte y Turismo', u'http://estaticos.expansion.com/rss/empresastransporte.xml'), (u'Automoci\xf3n e Industria', u'http://estaticos.expansion.com/rss/empresasauto-industria.xml'), (u'Distribuci\xf3n', u'http://estaticos.expansion.com/rss/empresasdistribucion.xml'), (u'Deporte y Negocio', u' http://estaticos.expansion.com/rss/empresasdeporte.xml'), (u'Mi Negocio', u'http://estaticos.expansion.com/rss/empresasminegocio.xml'), (u'Interiores', u'http://estaticos.expansion.com/rss/empresasinteriores.xml'), (u'Digitech', u'http://estaticos.expansion.com/rss/empresasdigitech.xml'), (u'Portada: Econom\xeda y Pol\xedtica', u'http://estaticos.expansion.com/rss/economiapolitica.xml'), (u'Pol\xedtica', u'http://estaticos.expansion.com/rss/economia.xml'), (u'Portada: Sociedad', u'http://estaticos.expansion.com/rss/entorno.xml'), (u'Portada: Opini\xf3n', u'http://estaticos.expansion.com/rss/opinion.xml'), (u'Llaves y editoriales', u'http://estaticos.expansion.com/rss/opinioneditorialyllaves.xml'), (u'Tribunas', u'http://estaticos.expansion.com/rss/opiniontribunas.xml'), (u'Portada: Jur\xeddico', u'http://estaticos.expansion.com/rss/juridico.xml'), (u'Entrevistas', u'http://estaticos.expansion.com/rss/juridicoentrevistas.xml'), (u'Opini\xf3n', u'http://estaticos.expansion.com/rss/juridicoopinion.xml'), (u'Sentencias', u'http://estaticos.expansion.com/rss/juridicosentencias.xml'), (u'Mujer', u'http://estaticos.expansion.com/rss/mujer-empresa.xml'), (u'Catalu\xf1a', u'http://estaticos.expansion.com/rss/catalunya.xml'), (u'Funci\xf3n p\xfablica', u'http://estaticos.expansion.com/rss/funcion-publica.xml') ] # Obtener la imagen de portada def get_cover_url(self): cover = None st = time.localtime() year = str(st.tm_year) month = "%.2d" % st.tm_mon day = "%.2d" % st.tm_mday #http://img5.kiosko.net/2011/11/14/es/expansion.750.jpg cover='http://img5.kiosko.net/'+ year + '/' + month + '/' + day +'/es/expansion.750.jpg' br = BasicNewsRecipe.get_browser() try: br.open(cover) except: self.log("\nPortada no disponible") cover ='http://www.aproahp.org/enlaces/images/diario_expansion.gif' return cover # Para que no salte la publicidad al recuperar la noticia, y que siempre se recupere # la página web, mando la variable "t" con la hora "linux" o "epoch" actual # haciendole creer al sitio web que justo se acaba de ver la publicidad def print_version(self, url): st = time.time() segundos = str(int(st)) parametros = '.html?t=' + segundos return url.replace('.html', parametros) _processed_links = [] def get_article_url(self, article): # Para obtener la url original del artículo a partir de la de "feedsportal" link = article.get('link', None) if link is None: return article if link.split('/')[-1]=="story01.htm": link=link.split('/')[-2] a=['0B','0C','0D','0E','0F','0G','0N' ,'0L0S','0A'] b=['.' ,'/' ,'?' ,'-' ,'=' ,'&' ,'.com','www.','0'] for i in range(0,len(a)): link=link.replace(a[i],b[i]) link="http://"+link # Eliminar artículos duplicados en otros feeds if not (link in self._processed_links): self._processed_links.append(link) else: link = None return link # Un poco de css para mejorar la presentación de las noticias extra_css = ''' .entradilla {font-family:Arial,Helvetica,sans-serif; font-weight:bold; font-style:italic; font-size:16px;} .fecha_publicacion,.autor {font-family:Arial,Helvetica,sans-serif; font-weight:bold;font-size:14px;} ''' # Para presentar la imagen de los videos incrustados preprocess_regexps = [ (re.compile(r'var imagen', re.DOTALL|re.IGNORECASE), lambda match: '--></script><img src'), (re.compile(r'.jpg";', re.DOTALL|re.IGNORECASE), lambda match: '.jpg">'), (re.compile(r'var id_reproductor', re.DOTALL|re.IGNORECASE), lambda match: '<script language="Javascript" type="text/javascript"><!--'), ] Last edited by kovidgoyal; 11-19-2011 at 10:05 PM. |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
New VErsion for "IL GIORNALE" recipe (now work correctly) | gambarini | Recipes | 0 | 11-09-2011 12:05 PM |
Guide for you can see correctly the pdfs "comics, images, documents" in ereaders sony | computersexperto | LRF | 2 | 11-29-2009 06:13 AM |
"Do not buy this book till the Kindle version is priced correctly!" --WTF? | taglines | News | 3 | 11-05-2009 11:32 AM |
"Sort By Author" not sorting correctly within author's collection | Sonist | Amazon Kindle | 1 | 08-05-2009 07:52 PM |
"Menu" and "Mark" keys does not work | murad | Sony Reader | 4 | 07-11-2009 12:35 PM |