I'm trying to do some updating to the Creative Blog recipe because I just found out there are a few articles that go to a second page and it's right now not pulling the second pages. So this is the modified recipe that doesn't work yet.
Spoiler:
__license__ = 'GPL v3'
__copyright__ = '2014, Bonni Salles - post in forum for help'
'''
Creative Blog (formerly .net magazine)
'''
from calibre.web.feeds.news import BasicNewsRecipe
class creativeblog(BasicNewsRecipe):
title = u'Creative Blog (formerly .Net magazine)'
__author__ = 'Bonni Salles'
oldest_article = 7
publication_type = 'blog'
max_articles_per_feed = 100
description = 'Web Design and Tutorials from Creative Blog (part of .Net Magazine and others)'
publisher = 'Creative Blog'
category = 'internet, web design'
language = 'en'
encoding = 'utf-8'
ignore_duplicate_articles = {'title', 'url'}
remove_empty_feeds = True
auto_cleanup = True
# presently this is set to download the whole group of blogs for the feed. If you want
# to limit it to the specific sections of the blog that you want to download.
feeds = [
(u'Creative Blog', u'http://www.creativebloq.com/feed/'),
# (u'3D', u'http://www.creativebloq.com/feed/3d'),
# (u'Adobe', u'http://www.creativebloq.com/feed/adobe'),
# (u'Animation', u'http://www.creativebloq.com/feed/animation'),
# (u'Apple', u'http://www.creativebloq.com/feed/apple'),
# (u'Branding', u'http://www.creativebloq.com/feed/branding'),
# (u'Graphic Design', u'http://www.creativebloq.com/feed/graphic-design'),
# (u'Illustration', u'http://www.creativebloq.com/feed/illustration'),
# (u'News', u'http://www.creativebloq.com/feed/news'),
# (u'Opinion', u'http://www.creativebloq.com/feed/opinion'),
# (u'Tutorials', u'http://www.creativebloq.com/feed/tutorial'),
# (u'Typography', u'http://www.creativebloq.com/feed/typography'),
# (u'Video', u'http://www.creativebloq.com/feed/video'),
# (u'web design', u'http://www.creativebloq.com/feed/web-design'),
]
def append_page(self, soup, appendtag, position, surl):
pager = soup.find('li', attrs={'class':'pager-current first'})
if pager:
nextpages = soup.findAll('li', attrs={'class':'pager-next'})
nextpage = nextpages[1]
if nextpage and (nextpage['href'] != surl):
nexturl = nextpage['href']
soup2 = self.index_to_soup(nexturl)
texttag = soup2.find('li', attrs={'class':'pager-next'})
for it in texttag.findAll(style=True):
del it['style']
newpos = len(texttag.contents)
self.append_page(soup2,texttag,newpos,nexturl)
texttag.extract()
pager.extract()
appendtag.insert(position,texttag)
def preprocess_html(self, soup):
self.append_page(soup, soup.body, 3, '')
pager = soup.find('li', attrs={'class':'pager-current first'})
if pager:
pager.extract()
return self.adeify_images(soup)
this is the html coding from one of the articles that shows how it links to the second page. One thing to note is that there are two spots that have the div class="item-list" as it's lead.
I think I have the type of coding that's needed, but if anyone has an easier way for the few articles to pull the second page, please let me know. Just to note, recursion doesn't work as it pulls a lot more links and creates a very big epub, already tried it.