Quote:
Originally Posted by TonytheBookworm
Alright time to call the pro in. Starson17 if you have a second can you look at this and tell me why the image doesn't show up?
|
It's staring you in the face, but you probably haven't run into it before.
Quote:
now sure what the heck i'm doing wrong because it looks like it should work.
when i used print statements it showed my newtag as <p> </p> but for whatever reason it never inserts the image data into that tag.
thanks for the help in advance.
Spoiler:
Code:
from calibre.web.feeds.recipes import BasicNewsRecipe
from BeautifulSoup import BeautifulSoup, Tag
class RevistaMuyInteresante(BasicNewsRecipe):
title = 'Revista Muy Interesante'
__author__ = 'Jefferson Frantz'
description = 'Revista de divulgacion'
timefmt = ' [%d %b, %Y]'
language = 'es_ES'
#conversion_options = {'linearize_tables' : True}
keep_only_tags = [dict(name='div', attrs={'class':['article']}),dict(name='td', attrs={'class':['txt_articulo']})]
remove_tags = [
dict(name=['object','link','script','ul'])
,dict(name='div', attrs={'id':['comment']})
,dict(name='td', attrs={'class':['buttonheading']})
,dict(name='div', attrs={'class':['tags_articles']})
]
remove_tags_after = dict(name='div', attrs={'class':'tags_articles'})
def nz_parse_section(self, url):
soup = self.index_to_soup(url)
div = soup.find(attrs={'class':'contenido'})
current_articles = []
for x in div.findAllNext(attrs={'class':['headline']}):
a = x.find('a', href=True)
if a is None:
continue
title = self.tag_to_string(a)
url = a.get('href', False)
if not url or not title:
continue
if url.startswith('/'):
url = 'http://www.muyinteresante.es'+url
self.log('\t\tFound article:', title)
self.log('\t\t\t', url)
current_articles.append({'title': title, 'url':url,
'description':'', 'date':''})
return current_articles
def parse_index(self):
feeds = []
for title, url in [
('Historia',
'http://www.muyinteresante.es/historia-articulos'),
]:
articles = self.nz_parse_section(url)
if articles:
feeds.append((title, articles))
return feeds
def preprocess_html(self, soup):
for img_tag in soup.findAll('img'):
parent_tag = img_tag.parent
data = img_tag
img_tag.extract()
newdiv = Tag(soup,'div')
newtag = Tag(soup,'p')
newtag.insert(0,data)
newdiv.insert(0,newtag)
parent_tag.insert(0,newdiv)
return soup
i keep getting this crap:
newdiv is: <div><p></p></div>
data is: <img style="float: left;" alt="ivision-marrojo" height="225" width="300" src="/images/stories/historia/ivision-marrojo.jpg" />
newtag is: <p></p>
which tells me it is obviously picking up the image tag and has it stored.
but for whatever reason it refuses to insert it into the newdiv
|
It's here (last two characters of "/>"):
Code:
data is: <img style="float: left;" alt="ivision-marrojo" height="225" width="300" src="/images/stories/historia/ivision-marrojo.jpg" />
For some reason, this self closing tag format makes Beautiful Soup very unhappy. Try this change to "data" in your recipe (bit of a kludge to work into whatever you're doing):
Code:
data = img_tag
new_img_tag = Tag(soup,'img')
new_img_tag['src'] = img_tag['src']
data = new_img_tag
and don't do the img_tag.extract(). I doubt if that's the most efficient way to do this, but I'm not really sure what you're doing with the recipe.