Quote:
Originally Posted by Starson17
Post your code. It should have worked.
|
Here are my tags. I'm working on the img and the fn.
Code:
keep_only_tags = [
dict(name='h1'),
dict(name='span', attrs={'class':'updated'}),
dict(name='span', attrs={'class':'fn'}),
dict(name='img', attrs={'id':'img-holder'}),
dict(name='span', attrs={'id':'gallery-cutline'}),
dict(name='div', attrs={'id':'blox-story-text'})
]
These tags are in order, so the previous sibling thing gets a little more confusing. I was trying to insert the fn, then the image. The fn tag works, but the image gets lost.
Code:
def preprocess_html(self,soup):
# print 'the soup is: ', soup
for fn_tag in soup.findAll("span", {"class" : "fn"}):
previousSibling_tag = fn_tag.previousSibling
if previousSibling_tag.name == 'span':
new_tag = Tag(soup,'p')
new_tag.insert(0,fn_tag)
previousSibling_tag.insert(1,new_tag)
for img_tag in soup.findAll('img'):
previousSibling_tag = img_tag.previousSibling
if previousSibling_tag.name == 'span':
new_tag = Tag(soup,'p')
new_tag.insert(0,img_tag)
previousSibling_tag.insert(2,new_tag)
return soup