View Single Post
Old 09-29-2010, 01:39 AM   #3
TonytheBookworm
Addict
TonytheBookworm is on a distinguished road
 
TonytheBookworm's Avatar
 
Posts: 264
Karma: 62
Join Date: May 2010
Device: kindle 2, kindle 3, Kindle fire
Alright time to call the pro in. Starson17 if you have a second can you look at this and tell me why the image doesn't show up? My thoughts in this are
  1. get image tag
  2. assign its value to a new variable
  3. extract the image tag from the soup
  4. make a new tag that contains a div and a p
  5. put the image data back into the soup under the p tag that was created.
now sure what the heck i'm doing wrong because it looks like it should work.
when i used print statements it showed my newtag as <p> </p> but for whatever reason it never inserts the image data into that tag.
thanks for the help in advance.
Spoiler:

Code:
from calibre.web.feeds.recipes import BasicNewsRecipe
from BeautifulSoup import BeautifulSoup, Tag

class RevistaMuyInteresante(BasicNewsRecipe):

    title       = 'Revista Muy Interesante'
    __author__  = 'Jefferson Frantz'
    description = 'Revista de divulgacion'
    timefmt = ' [%d %b, %Y]'
    language = 'es_ES'
    #conversion_options = {'linearize_tables' : True}
    keep_only_tags = [dict(name='div', attrs={'class':['article']}),dict(name='td', attrs={'class':['txt_articulo']})]
    remove_tags        = [
                             dict(name=['object','link','script','ul'])
                            ,dict(name='div', attrs={'id':['comment']})
                            ,dict(name='td', attrs={'class':['buttonheading']})
                            ,dict(name='div', attrs={'class':['tags_articles']})
                         ]

    remove_tags_after = dict(name='div', attrs={'class':'tags_articles'})


    


    def nz_parse_section(self, url):
            soup = self.index_to_soup(url)
            div = soup.find(attrs={'class':'contenido'})

            current_articles = []
            for x in div.findAllNext(attrs={'class':['headline']}):
                    a = x.find('a', href=True)
                    if a is None:
                        continue
                    title = self.tag_to_string(a)
                    url = a.get('href', False)
                    if not url or not title:
                        continue
                    if url.startswith('/'):
                         url = 'http://www.muyinteresante.es'+url
                    self.log('\t\tFound article:', title)
                    self.log('\t\t\t', url)
                    current_articles.append({'title': title, 'url':url,
                        'description':'', 'date':''})

            return current_articles


    def parse_index(self):
            feeds = []
            for title, url in [
                ('Historia',
                 'http://www.muyinteresante.es/historia-articulos'),
             ]:
               articles = self.nz_parse_section(url)
               if articles:
                   feeds.append((title, articles))
            return feeds
    
    def preprocess_html(self, soup):
        
        for img_tag in soup.findAll('img'):
            parent_tag = img_tag.parent
            data = img_tag
            img_tag.extract()
            newdiv = Tag(soup,'div')
            newtag = Tag(soup,'p')
            newtag.insert(0,data)
            newdiv.insert(0,newtag)
            parent_tag.insert(0,newdiv)
            
            
            
            
            
        return soup


i keep getting this crap:
newdiv is: <div><p></p></div>
data is: <img style="float: left;" alt="ivision-marrojo" height="225" width="300" src="/images/stories/historia/ivision-marrojo.jpg" />
newtag is: <p></p>

which tells me it is obviously picking up the image tag and has it stored.
but for whatever reason it refuses to insert it into the newdiv

Last edited by TonytheBookworm; 09-29-2010 at 01:45 AM.
TonytheBookworm is offline   Reply With Quote