View Single Post
Old 09-29-2010, 11:28 AM   #4
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by TonytheBookworm View Post
Alright time to call the pro in. Starson17 if you have a second can you look at this and tell me why the image doesn't show up?
It's staring you in the face, but you probably haven't run into it before.

Quote:
now sure what the heck i'm doing wrong because it looks like it should work.
when i used print statements it showed my newtag as <p> </p> but for whatever reason it never inserts the image data into that tag.
thanks for the help in advance.
Spoiler:

Code:
from calibre.web.feeds.recipes import BasicNewsRecipe
from BeautifulSoup import BeautifulSoup, Tag

class RevistaMuyInteresante(BasicNewsRecipe):

    title       = 'Revista Muy Interesante'
    __author__  = 'Jefferson Frantz'
    description = 'Revista de divulgacion'
    timefmt = ' [%d %b, %Y]'
    language = 'es_ES'
    #conversion_options = {'linearize_tables' : True}
    keep_only_tags = [dict(name='div', attrs={'class':['article']}),dict(name='td', attrs={'class':['txt_articulo']})]
    remove_tags        = [
                             dict(name=['object','link','script','ul'])
                            ,dict(name='div', attrs={'id':['comment']})
                            ,dict(name='td', attrs={'class':['buttonheading']})
                            ,dict(name='div', attrs={'class':['tags_articles']})
                         ]

    remove_tags_after = dict(name='div', attrs={'class':'tags_articles'})


    


    def nz_parse_section(self, url):
            soup = self.index_to_soup(url)
            div = soup.find(attrs={'class':'contenido'})

            current_articles = []
            for x in div.findAllNext(attrs={'class':['headline']}):
                    a = x.find('a', href=True)
                    if a is None:
                        continue
                    title = self.tag_to_string(a)
                    url = a.get('href', False)
                    if not url or not title:
                        continue
                    if url.startswith('/'):
                         url = 'http://www.muyinteresante.es'+url
                    self.log('\t\tFound article:', title)
                    self.log('\t\t\t', url)
                    current_articles.append({'title': title, 'url':url,
                        'description':'', 'date':''})

            return current_articles


    def parse_index(self):
            feeds = []
            for title, url in [
                ('Historia',
                 'http://www.muyinteresante.es/historia-articulos'),
             ]:
               articles = self.nz_parse_section(url)
               if articles:
                   feeds.append((title, articles))
            return feeds
    
    def preprocess_html(self, soup):
        
        for img_tag in soup.findAll('img'):
            parent_tag = img_tag.parent
            data = img_tag
            img_tag.extract()
            newdiv = Tag(soup,'div')
            newtag = Tag(soup,'p')
            newtag.insert(0,data)
            newdiv.insert(0,newtag)
            parent_tag.insert(0,newdiv)
            
            
            
            
            
        return soup


i keep getting this crap:
newdiv is: <div><p></p></div>
data is: <img style="float: left;" alt="ivision-marrojo" height="225" width="300" src="/images/stories/historia/ivision-marrojo.jpg" />
newtag is: <p></p>

which tells me it is obviously picking up the image tag and has it stored.
but for whatever reason it refuses to insert it into the newdiv
It's here (last two characters of "/>"):
Code:
data is:  <img style="float: left;" alt="ivision-marrojo" height="225" width="300" src="/images/stories/historia/ivision-marrojo.jpg" />
For some reason, this self closing tag format makes Beautiful Soup very unhappy. Try this change to "data" in your recipe (bit of a kludge to work into whatever you're doing):
Code:
            data = img_tag
            new_img_tag = Tag(soup,'img')
            new_img_tag['src'] = img_tag['src']
            data = new_img_tag
and don't do the img_tag.extract(). I doubt if that's the most efficient way to do this, but I'm not really sure what you're doing with the recipe.
Starson17 is offline   Reply With Quote