Barty,
Thanks for the help. After some work, I've been able to reorganize my code to make it do what I like, like you said. The only problem now is I've lost my images. The picture is in an img tag, which is buried inside of a link. Since I'm batch removing all links, I am not sure what to do. Here's what I have.
Code:
remove_tags_before = dict(name='div', attrs={'id':'blox-left-col'})
remove_tags_after = dict(name='div', attrs={'id':'blox-left-col'})
keep_only_tags = [
# dict(name='div', attrs={'id':'blox-left-col'}),
# dict(name='span', attrs={'class':'updated'}),
# dict(name='span', attrs={'class':'fn'}),
# dict(name='img', attrs={'id':'img-holder'}),
# dict(name='span', attrs={'id':'gallery-cutline'}),
# dict(name='div', attrs={'id':'blox-story-text'})
]
remove_tags = [
dict(name='a'),
dict(name='ul', attrs={'id':'blox-body-nav'}),
dict(name='p', attrs={'class':'story-keywords moz-border'}),
dict(name='div', attrs={'class':'clear'}),
dict(name='div', attrs={'class':'hide'}),
dict(name='p', attrs={'id':'story-tools'}),
dict(name='div', attrs={'id':'latest-by-section'}),
dict(name='span', attrs={'class':'bookmark hide'}),
dict(name='span', attrs={'class':'hide source-org vcard'}),
dict(name='dl', attrs={'id':'story-font-size'}),
dict(name='div', attrs={'class':'article-share-top'}),
dict(name='span', attrs={'id':'pictopiaURL'}),
dict(name='span', attrs={'id':'siteHost'}),
dict(name='span', attrs={'id':'mycaptureURL'}),
dict(name='span', attrs={'id':'mycapturePricingSheet'}),
dict(name='div', attrs={'class':'photo-cutline'}),
dict(name='div', attrs={'class':'blox-thumb-container'})
]
Here's how the website has this tag set up:
<a href="http://bloximages.chicago2.vip.townnews.com/jg-tc.com/content/tncms/assets/v3/editorial/1/93/19381b50-904a-5d15-91d0-9183c727b977/4eee65251d356.image.jpg" rel="facebox">
<img id="img-holder" src="http://bloximages.chicago2.vip.townnews.com/jg-tc.com/content/tncms/assets/v3/editorial/1/93/19381b50-904a-5d15-91d0-9183c727b977/4eee6525d4f59.preview-300.jpg" alt=" " width="300px"/>
</a>
I used to just pull the "img-holder" tag out by itself, but I can't do that now. Any ideas?