MobileRead Forums - View Single Post - Having problems using h1 tag since website changed

clintiepoo · 12-22-2011, 02:26 PM

I've been using a recipe to parse jg-tc.com for a while now. Here are the tags I'm keeping/removing:

keep_only_tags = [
dict(name='h1'),
dict(name='span', attrs={'class':'updated'}),
dict(name='span', attrs={'class':'fn'}),
dict(name='img', attrs={'id':'img-holder'}),
dict(name='span', attrs={'id':'gallery-cutline'}),
dict(name='div', attrs={'id':'blox-story-text'})

]
remove_tags = [
dict(name='a')
]

The problem is: they changed something with a facebook login or something, so now the titles to all the stories read like:

Login or
Title

Would somebody help me try to pull the title differently, or remove this "Login or" tag or text after the fact? I've tried several things myself and I'm struggling. Thanks!

Here's a sample webpage:
http://jg-tc.com/news/local/school-a...871e3ce6c.html

12-22-2011, 02:26 PM	#1
clintiepoo Member Posts: 19 Karma: 10 Join Date: Feb 2011 Device: kindle	Having problems using h1 tag since website changed I've been using a recipe to parse jg-tc.com for a while now. Here are the tags I'm keeping/removing: keep_only_tags = [ dict(name='h1'), dict(name='span', attrs={'class':'updated'}), dict(name='span', attrs={'class':'fn'}), dict(name='img', attrs={'id':'img-holder'}), dict(name='span', attrs={'id':'gallery-cutline'}), dict(name='div', attrs={'id':'blox-story-text'}) ] remove_tags = [ dict(name='a') ] The problem is: they changed something with a facebook login or something, so now the titles to all the stories read like: Login or Title Would somebody help me try to pull the title differently, or remove this "Login or" tag or text after the fact? I've tried several things myself and I'm struggling. Thanks! Here's a sample webpage: http://jg-tc.com/news/local/school-a...871e3ce6c.html