Quote:
Originally Posted by kiklop74
Than you just keep adding classes like this:
Code:
keep_only_tags = [dict(attrs={'class':['content-noticia-title','artigoHeader','ECOSFERA_MANCHETE','noticia','textoPrincipal','ECOSFERA_texto_01']})]
|
Thanks again. I fixed a little additional problem with the Desporto section. We are almost there.
The only section that is not displaying perfectly is the Ecosfera. Check this link:
http://ecosfera.publico.pt/noticia.aspx?id=1442165
There are a few elements there (
ECOSFERA_polaroid and
ECOSFERA_link_rel) that I am trying to remove, but within these father elements there are child elements also using
ECOSFERA_texto_01. How do I say "keep element X, as long as X is not within Y"?
Finally, the links on the bottom right corner under "Legislação" should not appear either. They are not in any specifically named div or table, so I do not know how to deal with them. I cannot use
remove_tags_after the previous one, because that's
ECOSFERA_texto_01, and that tag is used in more than one place. It would start deleting on the first instance of this tag, when I would need it to delete everything after the
last instance of this tag.
The current recipe:
Code:
keep_only_tags = [dict(attrs={'class':['content-noticia-title','artigoHeader','ECOSFERA_MANCHETE','noticia','textoPrincipal','ECOSFERA_texto_01']})]
remove_tags = [dict(attrs={'class':['options','subcoluna','''ECOSFERA_link_rel','ECOSFERA_polaroid''']})]