Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 04-04-2013, 06:34 AM   #1
josepinto
Connoisseur
josepinto began at the beginning.
 
Posts: 50
Karma: 10
Join Date: Apr 2005
Device: Nokia 5320
Diário de Notícias

Portuguese newspaper.

Basic recipe:

class AdvancedUserRecipe1365070687(BasicNewsRecipe):
title = u'Di\xe1rio de Not\xedcias'
oldest_article = 7
max_articles_per_feed = 100
auto_cleanup = True


feeds = [(u'Portugal', u'http://feeds.dn.pt/DN-Portugal'), (u'Globo', u'http://feeds.dn.pt/DN-Globo'), (u'Economia', u'http://feeds.dn.pt/DN-Economia'), (u'Ci\xeancia', u'http://feeds.dn.pt/DN-Ciencia'), (u'Artes', u'http://feeds.dn.pt/DN-Artes'), (u'TV & Media', u'http://feeds.dn.pt/DN-Media'), (u'Opini\xe3o', u'http://feeds.dn.pt/DN-Opiniao'), (u'Pessoas', u'http://feeds.dn.pt/DN-Pessoas')]

Some of the articles are not extracted and this text shows up:
"Por opção editorial, o exercício da liberdade de expressão é total, sem limitações, nas caixas de comentários abertas ao público disponibilizadas pelo Diário de Notícias em www.dn.pt. Os textos aí escritos podem, por vezes, ter um conteúdo susceptível de ferir o código moral ou ético de alguns leitores, pelo que o Diário de Notícias não recomenda a sua leitura a menores ou a pessoas mais sensíveis.
As opiniões, informações, argumentações e linguagem utilizadas pelos comentadores desse espaço não refletem, de algum modo, a linha editorial ou o trabalho jornalístico do Diário de Notícias. Os participante são incentivados a respeitar o Código de Conduta do Utilizador e os Termos de Uso e Política de Privacidade, que podem ser lidos neste endereço:
http://www.dn.pt/info/termosdeuso.aspx
O Diário de Notícias reserva-se o direito de proceder judicialmente ou de fornecer às autoridades informações que permitam a identificação de quem use as caixas de comentários em www.dn.pt para cometer ou incentivar atos considerados criminosos pela Lei Portuguesa, nomeadamente injúrias, difamações, apelo à violência, desrespeito pelos símbolos nacionais, promoção do racismo, xenofobia e homofobia ou quaisquer outros."

José Pinto
josepinto is offline   Reply With Quote
Old 04-04-2013, 07:50 PM   #2
oneillpt
Connoisseur
oneillpt began at the beginning.
 
Posts: 62
Karma: 46
Join Date: Feb 2011
Device: Kindle 3 (cracked screen!); PW1; Oasis
The replacement of content by boilerplate editorial disclaimers for some articles seems to be due to use of auto_cleanup. Try the version below where this is disabled. I have used keep_only_tags and remove_tags instead. (The unicode accented characters in the title caused problems for me. Put them back if they work for you)

Code:
class AdvancedUserRecipe1365070687(BasicNewsRecipe):
  title = u'Diario de Noticias'
  oldest_article = 7
  max_articles_per_feed = 100
  #auto_cleanup = True
  keep_only_tags = [dict(name='div', attrs={'id':'cln-esqmid'}) ]
  remove_tags    = [ dict(name='table', attrs={'class':'TabFerramentasInf'}) ]

  feeds = [(u'Portugal', u'http://feeds.dn.pt/DN-Portugal'), 
    (u'Globo', u'http://feeds.dn.pt/DN-Globo'), 
    (u'Economia', u'http://feeds.dn.pt/DN-Economia'), 
    (u'Ci\xeancia', u'http://feeds.dn.pt/DN-Ciencia'), 
    (u'Artes', u'http://feeds.dn.pt/DN-Artes'), 
    (u'TV & Media', u'http://feeds.dn.pt/DN-Media'), 
    (u'Opini\xe3o', u'http://feeds.dn.pt/DN-Opiniao'), 
    (u'Pessoas', u'http://feeds.dn.pt/DN-Pessoas')
    ]
oneillpt is offline   Reply With Quote
Advert
Old 04-05-2013, 06:26 AM   #3
josepinto
Connoisseur
josepinto began at the beginning.
 
Posts: 50
Karma: 10
Join Date: Apr 2005
Device: Nokia 5320
Quote:
Originally Posted by oneillpt View Post
The replacement of content by boilerplate editorial disclaimers for some articles seems to be due to use of auto_cleanup. Try the version below where this is disabled. I have used keep_only_tags and remove_tags instead. (The unicode accented characters in the title caused problems for me. Put them back if they work for you)

Code:
class AdvancedUserRecipe1365070687(BasicNewsRecipe):
  title = u'Diario de Noticias'
  oldest_article = 7
  max_articles_per_feed = 100
  #auto_cleanup = True
  keep_only_tags = [dict(name='div', attrs={'id':'cln-esqmid'}) ]
  remove_tags    = [ dict(name='table', attrs={'class':'TabFerramentasInf'}) ]

  feeds = [(u'Portugal', u'http://feeds.dn.pt/DN-Portugal'), 
    (u'Globo', u'http://feeds.dn.pt/DN-Globo'), 
    (u'Economia', u'http://feeds.dn.pt/DN-Economia'), 
    (u'Ci\xeancia', u'http://feeds.dn.pt/DN-Ciencia'), 
    (u'Artes', u'http://feeds.dn.pt/DN-Artes'), 
    (u'TV & Media', u'http://feeds.dn.pt/DN-Media'), 
    (u'Opini\xe3o', u'http://feeds.dn.pt/DN-Opiniao'), 
    (u'Pessoas', u'http://feeds.dn.pt/DN-Pessoas')
    ]
Thanks,

All text is extracted now.

Several sections could also be added but I personally do not use them:

Desporto:
http://feeds.dn.pt/DN-Desporto

Cartaz:
http://feeds.dn.pt/DN-Cartaz

Política:
http://feeds.dn.pt/DN-Politica

Gente:
http://feeds.dn.pt/DN-Gente

Galerias:
http://feeds.dn.pt/DN-Galeria

Side note: Terms of use of the feeds of this newspaper: http://www.dn.pt/info/termosdeuso.aspx

José Pinto
josepinto is offline   Reply With Quote
Old 04-05-2013, 07:09 AM   #4
josepinto
Connoisseur
josepinto began at the beginning.
 
Posts: 50
Karma: 10
Join Date: Apr 2005
Device: Nokia 5320
Not all text extracted

Quote:
Originally Posted by oneillpt View Post
The replacement of content by boilerplate editorial disclaimers for some articles seems to be due to use of auto_cleanup. Try the version below where this is disabled. I have used keep_only_tags and remove_tags instead. (The unicode accented characters in the title caused problems for me. Put them back if they work for you)

Code:
class AdvancedUserRecipe1365070687(BasicNewsRecipe):
  title = u'Diario de Noticias'
  oldest_article = 7
  max_articles_per_feed = 100
  #auto_cleanup = True
  keep_only_tags = [dict(name='div', attrs={'id':'cln-esqmid'}) ]
  remove_tags    = [ dict(name='table', attrs={'class':'TabFerramentasInf'}) ]

  feeds = [(u'Portugal', u'http://feeds.dn.pt/DN-Portugal'), 
    (u'Globo', u'http://feeds.dn.pt/DN-Globo'), 
    (u'Economia', u'http://feeds.dn.pt/DN-Economia'), 
    (u'Ci\xeancia', u'http://feeds.dn.pt/DN-Ciencia'), 
    (u'Artes', u'http://feeds.dn.pt/DN-Artes'), 
    (u'TV & Media', u'http://feeds.dn.pt/DN-Media'), 
    (u'Opini\xe3o', u'http://feeds.dn.pt/DN-Opiniao'), 
    (u'Pessoas', u'http://feeds.dn.pt/DN-Pessoas')
    ]
Hi again,

In several articles, only the title and the first paragraph of the text, wich is in bold, are extracted, but not the rest of the article.

I tried to insert use_embedded_content = False in the recipe but it doesn´t change anything.

José Pinto
josepinto is offline   Reply With Quote
Reply


Forum Jump


All times are GMT -4. The time now is 08:06 PM.


MobileRead.com is a privately owned, operated and funded community.