View Single Post
Old 12-28-2011, 03:25 PM   #6
clintiepoo
Member
clintiepoo began at the beginning.
 
Posts: 19
Karma: 10
Join Date: Feb 2011
Device: kindle
Quote:
Originally Posted by Barty View Post
you could probably do it with preprocess_regexps but why? Are you sure want to remove all links? You're going to get missing text, e.g.,

As we (link)argued in this column last month(link), the current situation is...

becomes

As we, the current situation is...

If you want to remove certain links, then target them, for example

remove_tags= [ dict(name='a',attrs={'href':re.compile(r'doublecli ck\.net',re.I)}) ]

to remove doubleclick links
I understand your logic, and now I've been able to get rid of all the links individually by using tags within them. But, I'm still having trouble getting the image out of this:

Code:
<a href="http://bloximages.chicago2.vip.townnews.com/jg-tc.com/content/tncms/assets/v3/editorial/0/c1/0c16b29b-e8fc-55a6-8d20-f9ba420f8230/4ef38d670cd77.image.jpg" rel="facebox">
            
                <img id="img-holder" src="http://bloximages.chicago2.vip.townnews.com/jg-tc.com/content/tncms/assets/v3/editorial/0/c1/0c16b29b-e8fc-55a6-8d20-f9ba420f8230/4ef38d679d43b.preview-300.jpg" alt=" " width="300px">
            </a>
The image is buried beneath the link. I used to use "img-holder to isolate it and just keep that tag, but I'm not sure how to do it now. If I try to keep this whole link (ie not remove it) the whole story blows up and fails to download. I'm getting close, just not there.
clintiepoo is offline   Reply With Quote