View Single Post
Old 06-03-2010, 02:54 PM   #2036
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by kidtwisted View Post
Hey Starson17,
I'm kinda stuck, when I add the append_page code the test html only contains the feed description and date, with out it I get the 1st page so I'm screwing it up somewhere.
You're right - you screwed it up somewhere
Don't worry, you're in good company.
Quote:
Spoiler:

here's what I have for tweaktown.com:
Code:
class AdvancedUserRecipe1273795663(BasicNewsRecipe):
    title = u'TweakTown Latest Tech'
    description = 'TweakTown Latest Tech'
    __author__ = 'KidTwisted'
    publisher             = 'TweakTown'
    category              = 'PC Articles, Reviews and Guides'
    use_embedded_content   = False
    max_articles_per_feed = 1
    oldest_article = 7
    timefmt  = ' [%Y %b %d ]'
    no_stylesheets = True
    language = 'en'
    #recursion             = 10
    remove_javascript = True
    conversion_options = { 'linearize_tables' : True}
   # reverse_article_order = True
    #INDEX                 = u'http://www.tweaktown.com'

    html2lrf_options = [
                          '--comment', description
                        , '--category', category
                        , '--publisher', publisher
                        ]
    
    html2epub_options = 'publisher="' + publisher + '"\ncomments="' + description + '"\ntags="' + category + '"' 
	
    keep_only_tags = [dict(name='div', attrs={'id':['article']})]

    feeds =  [ (u'Articles Reviews', u'http://feeds.feedburner.com/TweaktownArticlesReviewsAndGuidesRss20?format=xml') ]

    def get_article_url(self, article):
        return article.get('guid',  None)
    
    def append_page(self, soup, appendtag, position):
        pager = soup.find('a',attrs={'class':'next'})
        if pager:
           nexturl = pager.a['href']
           soup2 = self.index_to_soup(nexturl)
           texttag = soup2.find('div', attrs={'id':'article'})
           for it in texttag.findAll(style=True):
               del it['style']
           newpos = len(texttag.contents)          
           self.append_page(soup2,texttag,newpos)
           texttag.extract()
           appendtag.insert(position,texttag)
        
    
    def preprocess_html(self, soup):
        mtag = '<meta http-equiv="Content-Language" content="en-US"/>\n<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>'
        soup.head.insert(0,mtag)    
        for item in soup.findAll(style=True):
            del item['style']
        self.append_page(soup, soup.body, 3)
        pager = soup.find('a',attrs={'class':'next'})
        if pager:
           pager.extract()        
        return soup

Could you or someone in the know take a look at it to see what I'm doing wrong. I commented out "INDEX" because the link for the next page is a complete link, any help on this would be great.
The error is subtle. You did a good job of converting the sample code, but look at these lines from your code:
Code:
        pager = soup.find('a',attrs={'class':'next'})
        if pager:
           nexturl = pager.a['href']
Compare to the sample code:
Code:
        pager = soup.find('div',attrs={'class':'toolbar_fat_next'})
        if pager:
           nexturl = self.INDEX + pager.a['href']
In the sample, the next page link was inside an <a> tag which was, in turn, inside a <div> tag. The sample code searched for the <div> tag, then grabbed the <a> tag's "href" inside it. In your case, the <a> is marked with the class='next' so you didn't search for its parent, you searched directly for the <a> tag. That's fine, but then you copied the code that looked for an <a> tag inside the tag you found, and there wasn't one.

You need to change nexturl = pager.a['href'] to:
Code:
nexturl = pager['href']
Hold on .... let me test it .....

Yep - That does it. There's still lots of junk in my output, but it's definitely pulling multipages. My recipe may be slightly different from yours, but I think that should get you on your way.

Last edited by Starson17; 06-03-2010 at 03:51 PM.
Starson17 is offline