View Single Post
Old 12-26-2011, 12:27 AM   #4
clintiepoo
Member
clintiepoo began at the beginning.
 
Posts: 19
Karma: 10
Join Date: Feb 2011
Device: kindle
Barty,

Thanks for the help. After some work, I've been able to reorganize my code to make it do what I like, like you said. The only problem now is I've lost my images. The picture is in an img tag, which is buried inside of a link. Since I'm batch removing all links, I am not sure what to do. Here's what I have.

Code:
    remove_tags_before = dict(name='div', attrs={'id':'blox-left-col'})
    remove_tags_after = dict(name='div', attrs={'id':'blox-left-col'})
    keep_only_tags = [ 
#                        dict(name='div', attrs={'id':'blox-left-col'}),
#                        dict(name='span', attrs={'class':'updated'}),
#                        dict(name='span', attrs={'class':'fn'}),
#                        dict(name='img', attrs={'id':'img-holder'}),
#                        dict(name='span', attrs={'id':'gallery-cutline'}),
#                        dict(name='div', attrs={'id':'blox-story-text'})
                     ]
    remove_tags = [
					 dict(name='a'),
                     dict(name='ul', attrs={'id':'blox-body-nav'}),
					 dict(name='p', attrs={'class':'story-keywords moz-border'}),
					 dict(name='div', attrs={'class':'clear'}),
					 dict(name='div', attrs={'class':'hide'}),
					 dict(name='p', attrs={'id':'story-tools'}),
					 dict(name='div', attrs={'id':'latest-by-section'}),
                     dict(name='span', attrs={'class':'bookmark hide'}),
					 dict(name='span', attrs={'class':'hide source-org vcard'}),
					 dict(name='dl', attrs={'id':'story-font-size'}),
					 dict(name='div', attrs={'class':'article-share-top'}),
					 dict(name='span', attrs={'id':'pictopiaURL'}),
					 dict(name='span', attrs={'id':'siteHost'}),
					 dict(name='span', attrs={'id':'mycaptureURL'}),
					 dict(name='span', attrs={'id':'mycapturePricingSheet'}),
					 dict(name='div', attrs={'class':'photo-cutline'}),
					 dict(name='div', attrs={'class':'blox-thumb-container'})
					 
					 
                  ]
Here's how the website has this tag set up:

<a href="http://bloximages.chicago2.vip.townnews.com/jg-tc.com/content/tncms/assets/v3/editorial/1/93/19381b50-904a-5d15-91d0-9183c727b977/4eee65251d356.image.jpg" rel="facebox">

<img id="img-holder" src="http://bloximages.chicago2.vip.townnews.com/jg-tc.com/content/tncms/assets/v3/editorial/1/93/19381b50-904a-5d15-91d0-9183c727b977/4eee6525d4f59.preview-300.jpg" alt=" " width="300px"/>
</a>

I used to just pull the "img-holder" tag out by itself, but I can't do that now. Any ideas?
clintiepoo is offline   Reply With Quote