View Single Post
Old 12-22-2011, 02:26 PM   #1
clintiepoo
Member
clintiepoo began at the beginning.
 
Posts: 19
Karma: 10
Join Date: Feb 2011
Device: kindle
Having problems using h1 tag since website changed

I've been using a recipe to parse jg-tc.com for a while now. Here are the tags I'm keeping/removing:

keep_only_tags = [
dict(name='h1'),
dict(name='span', attrs={'class':'updated'}),
dict(name='span', attrs={'class':'fn'}),
dict(name='img', attrs={'id':'img-holder'}),
dict(name='span', attrs={'id':'gallery-cutline'}),
dict(name='div', attrs={'id':'blox-story-text'})

]
remove_tags = [
dict(name='a')
]

The problem is: they changed something with a facebook login or something, so now the titles to all the stories read like:

Login or
Title

Would somebody help me try to pull the title differently, or remove this "Login or" tag or text after the fact? I've tried several things myself and I'm struggling. Thanks!

Here's a sample webpage:
http://jg-tc.com/news/local/school-a...871e3ce6c.html
clintiepoo is offline   Reply With Quote