View Single Post
Old 03-02-2011, 04:26 PM   #13
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by Starson17 View Post
I took a look at your site. The parent of your img tag is the body tag. You can use that tag or play with the next sibling and previous sibling tags. You can't just tell BS to put the img inside another tag. You can create a p tag (with "Tag") and you can put your found img tag inside it, but now it's no longer in your page. It's hanging free and that p tag with the img inside it needs to be put into a tag in the tree forming your page. You'll need to use insert or replaceWith from BS to do that. Study the code I gave you, then study the BS docs at the link above to see how it's done. I suspect you'll come to understand why this is a pain to solve, and that's where we started.
I had a few more minutes, so I thought I'd explain what's happening in the code I posted so you can modify it. BeautifulSoup makes it easy to find your img tag. The problem is what to do with it. Let's say you create a p tag, then put your img tag inside. This actually moves the img tag away from your page, and puts it into your disconnected created-from-scratch p tag. Your problem now is that you've lost track of where the img tag came from. The solution is to figure out where it came from before you remove it.

The code I posted dealt with this issue by replacing the parent_tag (for an img tag embedded in an a tag). The a tag remained in the page soup, and the new p tag, with the img tag inside it, was used to replace the parent a tag.

That was the first case in the if statement. The second case that author dealt with was where the parent tag was a p tag, not an a tag. Now you might ask why he was putting a p tag around the img if the parent tag was already a p tag. That's because the img tag wasn't the only content in the parent p tag. There was also text, and the author was trying to get the img onto a new line, separate from the other text.

To do that, he created a div tag and a p tag. He put the img tag into the new p tag and put that into the new div tag. Now he's got a disconnected div tag with the p tag and img tag inside it. At this point, the img tag (and its image) are no longer on the page. He's going to have to put it back. The text inside the parent tag is still in the page. Next, he replaces the parent tag with the div tag. This deletes the parent tag from the page, but it's not lost. He still has a reference to that parent tag (minus the img tag he previously removed from it.) He sticks the parent tag into the div tag that is now in the page soup with "new_div.insert(1,parent_tag)". The index of 1 means that he puts the parent tag (with the text) in after the img tag (which has index 0.) The result is a new p tag around the img tag, followed by the original text in the original parent tag.

So now you know why I said it's not that easy, and why many recipes don't bother to do this cleanup. All you have to do is figure out how to apply this to your page.

Last edited by Starson17; 03-02-2011 at 04:33 PM.
Starson17 is offline   Reply With Quote