I've been using a recipe to parse jg-tc.com for a while now. Here are the tags I'm keeping/removing:
keep_only_tags = [
dict(name='h1'),
dict(name='span', attrs={'class':'updated'}),
dict(name='span', attrs={'class':'fn'}),
dict(name='img', attrs={'id':'img-holder'}),
dict(name='span', attrs={'id':'gallery-cutline'}),
dict(name='div', attrs={'id':'blox-story-text'})
]
remove_tags = [
dict(name='a')
]
The problem is: they changed something with a facebook login or something, so now the titles to all the stories read like:
Login or
Title
Would somebody help me try to pull the title differently, or remove this "Login or" tag or text after the fact? I've tried several things myself and I'm struggling. Thanks!
Here's a sample webpage:
http://jg-tc.com/news/local/school-a...871e3ce6c.html