MobileRead Forums - View Single Post - Custom recipes (archive, read-only)

Starson17 · 06-03-2010, 02:54 PM

Quote:

Originally Posted by kidtwisted

Hey Starson17,
I'm kinda stuck, when I add the append_page code the test html only contains the feed description and date, with out it I get the 1st page so I'm screwing it up somewhere.

You're right - you screwed it up somewhere

Don't worry, you're in good company.

Quote:

Spoiler:

here's what I have for tweaktown.com:

Code:

class AdvancedUserRecipe1273795663(BasicNewsRecipe):
    title = u'TweakTown Latest Tech'
    description = 'TweakTown Latest Tech'
    __author__ = 'KidTwisted'
    publisher             = 'TweakTown'
    category              = 'PC Articles, Reviews and Guides'
    use_embedded_content   = False
    max_articles_per_feed = 1
    oldest_article = 7
    timefmt  = ' [%Y %b %d ]'
    no_stylesheets = True
    language = 'en'
    #recursion             = 10
    remove_javascript = True
    conversion_options = { 'linearize_tables' : True}
   # reverse_article_order = True
    #INDEX                 = u'http://www.tweaktown.com'

    html2lrf_options = [
                          '--comment', description
                        , '--category', category
                        , '--publisher', publisher
                        ]
    
    html2epub_options = 'publisher="' + publisher + '"\ncomments="' + description + '"\ntags="' + category + '"' 
	
    keep_only_tags = [dict(name='div', attrs={'id':['article']})]

    feeds =  [ (u'Articles Reviews', u'http://feeds.feedburner.com/TweaktownArticlesReviewsAndGuidesRss20?format=xml') ]

    def get_article_url(self, article):
        return article.get('guid',  None)
    
    def append_page(self, soup, appendtag, position):
        pager = soup.find('a',attrs={'class':'next'})
        if pager:
           nexturl = pager.a['href']
           soup2 = self.index_to_soup(nexturl)
           texttag = soup2.find('div', attrs={'id':'article'})
           for it in texttag.findAll(style=True):
               del it['style']
           newpos = len(texttag.contents)          
           self.append_page(soup2,texttag,newpos)
           texttag.extract()
           appendtag.insert(position,texttag)
        
    
    def preprocess_html(self, soup):
        mtag = '<meta http-equiv="Content-Language" content="en-US"/>\n<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>'
        soup.head.insert(0,mtag)    
        for item in soup.findAll(style=True):
            del item['style']
        self.append_page(soup, soup.body, 3)
        pager = soup.find('a',attrs={'class':'next'})
        if pager:
           pager.extract()        
        return soup

Could you or someone in the know take a look at it to see what I'm doing wrong. I commented out "INDEX" because the link for the next page is a complete link, any help on this would be great.

The error is subtle. You did a good job of converting the sample code, but look at these lines from your code:

Code:

        pager = soup.find('a',attrs={'class':'next'})
        if pager:
           nexturl = pager.a['href']

Compare to the sample code:

Code:

        pager = soup.find('div',attrs={'class':'toolbar_fat_next'})
        if pager:
           nexturl = self.INDEX + pager.a['href']

In the sample, the next page link was inside an <a> tag which was, in turn, inside a <div> tag. The sample code searched for the <div> tag, then grabbed the <a> tag's "href" inside it. In your case, the <a> is marked with the class='next' so you didn't search for its parent, you searched directly for the <a> tag. That's fine, but then you copied the code that looked for an <a> tag inside the tag you found, and there wasn't one.

You need to change nexturl = pager.a['href'] to:

Code:

nexturl = pager['href']

Hold on .... let me test it .....

Yep - That does it. There's still lots of junk in my output, but it's definitely pulling multipages. My recipe may be slightly different from yours, but I think that should get you on your way.