View Single Post
Old 06-16-2010, 01:35 PM   #2115
lordvetinari2
Zealot
lordvetinari2 is on a distinguished road
 
Posts: 137
Karma: 61
Join Date: Jun 2006
Location: Gijón, Spain
Device: Kindle 3G+WiFi & Galaxy Note
Quote:
Originally Posted by kiklop74 View Post
Than you just keep adding classes like this:

Code:
    keep_only_tags = [dict(attrs={'class':['content-noticia-title','artigoHeader','ECOSFERA_MANCHETE','noticia','textoPrincipal','ECOSFERA_texto_01']})]
Thanks again. I fixed a little additional problem with the Desporto section. We are almost there.

The only section that is not displaying perfectly is the Ecosfera. Check this link: http://ecosfera.publico.pt/noticia.aspx?id=1442165

There are a few elements there (ECOSFERA_polaroid and ECOSFERA_link_rel) that I am trying to remove, but within these father elements there are child elements also using ECOSFERA_texto_01. How do I say "keep element X, as long as X is not within Y"?

Finally, the links on the bottom right corner under "Legislação" should not appear either. They are not in any specifically named div or table, so I do not know how to deal with them. I cannot use remove_tags_after the previous one, because that's ECOSFERA_texto_01, and that tag is used in more than one place. It would start deleting on the first instance of this tag, when I would need it to delete everything after the last instance of this tag.

The current recipe:

Code:
    keep_only_tags = [dict(attrs={'class':['content-noticia-title','artigoHeader','ECOSFERA_MANCHETE','noticia','textoPrincipal','ECOSFERA_texto_01']})]
    remove_tags    = [dict(attrs={'class':['options','subcoluna','''ECOSFERA_link_rel','ECOSFERA_polaroid''']})]
lordvetinari2 is offline