View Single Post
Old 09-02-2010, 03:57 PM   #2600
TonytheBookworm
Addict
TonytheBookworm is on a distinguished road
 
TonytheBookworm's Avatar
 
Posts: 264
Karma: 62
Join Date: May 2010
Device: kindle 2, kindle 3, Kindle fire
Quote:
Originally Posted by Starson17 View Post
Is there any reason you can't do something like this to find both:
Code:
soup.find('div',attrs={'class':['pagination', 'top10pagnation']})
I tried that with no luck....
Here is what I have thus far:
For some reason it takes a century to finish even when i use the text command line I discovered after further looking at the html that it appears the pagination even though it is in different tags it always appears be nested inside of articleFooter. So here is what I came up with. Notice my comments. Definitely in the learning process on this one haha.

Spoiler:

Code:
from calibre.web.feeds.news import BasicNewsRecipe

class AdvancedUserRecipe1282101454(BasicNewsRecipe):
    title = 'How Stuff Works'
    language = 'en'
    __author__ = 'TonytheBookworm'
    description = 'How stuff works'
    publisher = 'Tony'
    category = 'information'
    oldest_article = 7
    max_articles_per_feed = 100
    no_stylesheets = True
    #INDEX                 = u'http://www.adventuregamers.com'
    #extra_css = '.headline {font-size: x-large;} \n .fact { padding-top: 10pt }'
    #masthead_url = 'http://gawand.org/wp-content/uploads/2010/06/ajc-logo.gif'
    keep_only_tags    = [
                         dict(name='div', attrs={'class':['articleBody','articleFooter']})
      #                 ,dict(attrs={'id':['cxArticleText','cxArticleBodyText']})
                        ]
    feeds          = [
                      ('AutoStuff', 'http://feeds.feedburner.com/HowstuffworksAutostuffDailyRssFeed'),
                      
                    ]
   
    def append_page(self, soup, appendtag, position):
        pager = soup.find('div',attrs={'class':'articleFooter'}) # articleFooter contains the nextpage navigation 
        print 'the pager soup is: ', pager
        if pager:
           nexturl = pager.a['href']
           print 'THE NEXT URL IS: ', nexturl
           soup2 = self.index_to_soup(nexturl)
           texttag = soup2.find('div', attrs={'class':'articleBody'}) # find the content body for the nextpage
           for it in texttag.findAll(style=True):
               del it['style']
           newpos = len(texttag.contents)          
           self.append_page(soup2,texttag,newpos)
           texttag.extract()
           appendtag.insert(position,texttag) 
   
                    
    def preprocess_html(self, soup):
       for item in soup.findAll(style=True):
           del item['style']
       self.append_page(soup, soup.body, 3)
       # don't think i need this then again I'm not sure 
       
       #pager = soup.find('div',attrs={'class':'toolbar_fat'})
       #if pager:
        #  pager.extract()        
       return soup


by the way once again THANK YOU FOR DEVOTING YOUR TIME IN HELPING ME. Very much appreciated!!! That goes for others as well.

added****
it looks like I gets stuck in a infinite loop
notice how it takes and successfully gets the next url
then when it goes to the next url it takes and find the url for the previous page. so it goes back to it. then it turns around and goes to the next page again then back and so on
Spoiler:

THE NEXT URL IS: http://auto.howstuffworks.com/under-...-insurance.htm
the pager soup is: <div class="articleFooter">
<div class="pagination">
<a href="http://auto.howstuffworks.com/under-the-hood/cost-of-car-ownership/liability-car-insurance1.htm" class="next" omnivars="&amp;c25=Page Zero : Next Button Bottom&amp;v39=Page Zero : Next Button Bottom" omni="How much liability car insurance do I need? : Next : Bottom : Page 0">Next Page</a>
<div class="clearer"></div>
</div>
</div>
THE NEXT URL IS: http://auto.howstuffworks.com/under-...insurance1.htm
the pager soup is: <div class="articleFooter">
<div class="pagination">
<a href="http://auto.howstuffworks.com/under-the-hood/cost-of-car-ownership/liability-car-insurance.htm" class="previous" omnivars="&amp;c25=LMI : Previous Button Bottom&amp;v39=LMI : Previous Button Bottom" omni="How much liability car insurance do I need? : Previous : Bottom : Page 1">Previous Page</a>
<div class="clearer"></div>
</div>
</div>
THE NEXT URL IS: http://auto.howstuffworks.com/under-...-insurance.htm
the pager soup is: <div class="articleFooter">
<div class="pagination">
<a href="http://auto.howstuffworks.com/under-the-hood/cost-of-car-ownership/liability-car-insurance1.htm" class="next" omnivars="&amp;c25=Page Zero : Next Button Bottom&amp;v39=Page Zero : Next Button Bottom" omni="How much liability car insurance do I need? : Next : Bottom : Page 0">Next Page</a>
<div class="clearer"></div>
</div>
</div>

Last edited by TonytheBookworm; 09-02-2010 at 04:01 PM. Reason: added output from log
TonytheBookworm is offline