Quote:
Originally Posted by Starson17
I had a few more minutes, so I thought I'd explain what's happening in the code I posted so you can modify it. BeautifulSoup makes it easy to find your img tag. The problem is what to do with it. Let's say you create a p tag, then put your img tag inside. This actually moves the img tag away from your page, and puts it into your disconnected created-from-scratch p tag. Your problem now is that you've lost track of where the img tag came from. The solution is to figure out where it came from before you remove it.
The code I posted dealt with this issue by replacing the parent_tag (for an img tag embedded in an a tag). The a tag remained in the page soup, and the new p tag, with the img tag inside it, was used to replace the parent a tag.
That was the first case in the if statement. The second case that author dealt with was where the parent tag was a p tag, not an a tag. Now you might ask why he was putting a p tag around the img if the parent tag was already a p tag. That's because the img tag wasn't the only content in the parent p tag. There was also text, and the author was trying to get the img onto a new line, separate from the other text.
To do that, he created a div tag and a p tag. He put the img tag into the new p tag and put that into the new div tag. Now he's got a disconnected div tag with the p tag and img tag inside it. At this point, the img tag (and its image) are no longer on the page. He's going to have to put it back. The text inside the parent tag is still in the page. Next, he replaces the parent tag with the div tag. This deletes the parent tag from the page, but it's not lost. He still has a reference to that parent tag (minus the img tag he previously removed from it.) He sticks the parent tag into the div tag that is now in the page soup with "new_div.insert(1,parent_tag)". The index of 1 means that he puts the parent tag (with the text) in after the img tag (which has index 0.) The result is a new p tag around the img tag, followed by the original text in the original parent tag.
So now you know why I said it's not that easy, and why many recipes don't bother to do this cleanup. All you have to do is figure out how to apply this to your page. 
|
I can understand what you're saying, but I can't seem to make it work for my website.

I wouldn't be offended if you just went ahead and made it work.
I'm trying something like this.
Code:
def preprocess_html(self,soup):
for pix in soup.findAll('img'):
new_tag=tag(soup,'p')
new_tag.insert(0,pix)
pix.replaceWith(new_tag)
return soup
In my mind, this should first find all the images. For example.
<img id="img-holder" src="xyz.jpg" alt=" " width="300px">
Next, it inserts a <p> tag around the img, returning:
<p><img id="img-holder" src="xyz.jpg" alt=" " width="300px"></p>
Finally, it takes all of this, and uses it in place of what the img before.
Unfortunately, it doesn't work at all. I understand it's not part of the page anymore, but I figured it would stick it on the bottom or something. When I do this, most of the articles fail to even download.
A couple more questions:
Is there a way to step through this code and watch variables? I'm a lot more comfortable with vba, and you can do that easily in there.
I am still thinking the problem is that my img's parent is the body. I don't know how to fix that.
Here's an example what's around the img:
Code:
<div id="blox-story-photo-container">
<span id="pictopiaURL" title="http://pictopia.com/perl/ptp/heraldreview"></span>
<span id="siteHost" title="http://www.herald-review.com"></span>
<div id="blox-large-photo-page">
<a name="photos"></a>
<a href="http://bloximages.chicago2.vip.townnews.com/herald-review.com/content/tncms/assets/editorial/5/d5/290/5d529006-4493-11e0-b654-001cc4c002e0-revisions/4d6ddeaceb281.image.jpg" rel="facebox">
<img id="img-holder" src="http://bloximages.chicago2.vip.townnews.com/herald-review.com/content/tncms/assets/editorial/5/d5/290/5d529006-4493-11e0-b654-001cc4c002e0-revisions/4d6ddeacee535.preview-300.jpg" alt=" " width="300px">
</a>
<p class="photo-cutline">
<a id="gallery-buy" href="http://pictopia.com/perl/ptp/heraldreview?photo_name=676ecb86-4493-11e0-8e81-001cc4c002e0&title=Bill Cole, E'Twaun Moore&t_url=http://bloximages.chicago2.vip.townnews.com/herald-review.com/content/tncms/assets/editorial/5/d5/290/5d529006-4493-11e0-b654-001cc4c002e0-revisions/4d6ddeaceb281.image.jpg&fs_url=http://bloximages.chicago2.vip.townnews.com/herald-review.com/content/tncms/assets/editorial/5/d5/290/5d529006-4493-11e0-b654-001cc4c002e0-revisions/4d6ddeacef8bc.hires.jpg&pps=buynow" rel="external"><img src="global/resources/images/buy-photo.gif" alt="buy this photo"></a>
<span id="gallery-cutline">Purdue guard E'Twaun Moore, right, shoots over Illinois forward Bill Cole in the second half of an NCAA college basketball game in West Lafayette, Ind., Tuesday, March 1, 2011. Purdue defeated Illinois 75-67.</span>
<span class="clear"></span>
</p>
</div>
<div class="clear"></div>
</div>