Quote:
Originally Posted by kidtwisted
Hey Starson17,
I'm kinda stuck, when I add the append_page code the test html only contains the feed description and date, with out it I get the 1st page so I'm screwing it up somewhere.
|
You're right - you screwed it up somewhere

Don't worry, you're in good company.
Quote:
Spoiler:
here's what I have for tweaktown.com:
Code:
class AdvancedUserRecipe1273795663(BasicNewsRecipe):
title = u'TweakTown Latest Tech'
description = 'TweakTown Latest Tech'
__author__ = 'KidTwisted'
publisher = 'TweakTown'
category = 'PC Articles, Reviews and Guides'
use_embedded_content = False
max_articles_per_feed = 1
oldest_article = 7
timefmt = ' [%Y %b %d ]'
no_stylesheets = True
language = 'en'
#recursion = 10
remove_javascript = True
conversion_options = { 'linearize_tables' : True}
# reverse_article_order = True
#INDEX = u'http://www.tweaktown.com'
html2lrf_options = [
'--comment', description
, '--category', category
, '--publisher', publisher
]
html2epub_options = 'publisher="' + publisher + '"\ncomments="' + description + '"\ntags="' + category + '"'
keep_only_tags = [dict(name='div', attrs={'id':['article']})]
feeds = [ (u'Articles Reviews', u'http://feeds.feedburner.com/TweaktownArticlesReviewsAndGuidesRss20?format=xml') ]
def get_article_url(self, article):
return article.get('guid', None)
def append_page(self, soup, appendtag, position):
pager = soup.find('a',attrs={'class':'next'})
if pager:
nexturl = pager.a['href']
soup2 = self.index_to_soup(nexturl)
texttag = soup2.find('div', attrs={'id':'article'})
for it in texttag.findAll(style=True):
del it['style']
newpos = len(texttag.contents)
self.append_page(soup2,texttag,newpos)
texttag.extract()
appendtag.insert(position,texttag)
def preprocess_html(self, soup):
mtag = '<meta http-equiv="Content-Language" content="en-US"/>\n<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>'
soup.head.insert(0,mtag)
for item in soup.findAll(style=True):
del item['style']
self.append_page(soup, soup.body, 3)
pager = soup.find('a',attrs={'class':'next'})
if pager:
pager.extract()
return soup
Could you or someone in the know take a look at it to see what I'm doing wrong. I commented out "INDEX" because the link for the next page is a complete link, any help on this would be great.
|
The error is subtle. You did a good job of converting the sample code, but look at these lines from your code:
Code:
pager = soup.find('a',attrs={'class':'next'})
if pager:
nexturl = pager.a['href']
Compare to the sample code:
Code:
pager = soup.find('div',attrs={'class':'toolbar_fat_next'})
if pager:
nexturl = self.INDEX + pager.a['href']
In the sample, the next page link was inside an <a> tag which was, in turn, inside a <div> tag. The sample code searched for the <div> tag, then grabbed the <a> tag's "href" inside it. In your case, the <a> is marked with the class='next' so you didn't search for its parent, you searched directly for the <a> tag. That's fine, but then you copied the code that looked for an <a> tag inside the tag you found, and there wasn't one.
You need to change nexturl = pager.a['href'] to:
Code:
nexturl = pager['href']
Hold on .... let me test it .....
Yep - That does it. There's still lots of junk in my output, but it's definitely pulling multipages. My recipe may be slightly different from yours, but I think that should get you on your way.