multi-page coding for Creative Blog

Camper65 · 03-15-2015, 06:17 PM

I'm trying to do some updating to the Creative Blog recipe because I just found out there are a few articles that go to a second page and it's right now not pulling the second pages. So this is the modified recipe that doesn't work yet.

Spoiler:

this is the html coding from one of the articles that shows how it links to the second page. One thing to note is that there are two spots that have the div class="item-list" as it's lead.

Spoiler:

I think I have the type of coding that's needed, but if anyone has an easier way for the few articles to pull the second page, please let me know. Just to note, recursion doesn't work as it pulls a lot more links and creates a very big epub, already tried it.

kovidgoyal · 03-16-2015, 12:43 AM

Dont use append_page(). Set recursions = 1 and use is_link_wanted() instead.

Camper65 · 03-16-2015, 11:14 PM

Kovid I modified it to use the recursive and is_link_wanted but can't get the is_link_wanted right. Can you help with what I should be using to find the url that corresponds to li class=pager-next or use the href that correspondence to the "next" entry so that I can get the second page (or more) properly. I am again posting the recipe so you can see the updates.

Spoiler:

__license__ = 'GPL v3'

Code:

__copyright__ = '2014, Bonni Salles - post in forum for help'
'''
Creative Blog (formerly .net magazine)
'''

from calibre.web.feeds.news import BasicNewsRecipe

class creativeblog(BasicNewsRecipe):
    title          = u'Creative Blog (formerly .Net magazine)'
    __author__     = 'Bonni Salles'
    oldest_article = 7
    publication_type = 'blog'
    max_articles_per_feed = 100
    description    = 'Web Design and Tutorials from Creative Blog (part of .Net Magazine and others)'
    publisher      = 'Creative Blog'
    category       = 'internet, web design'
    language       = 'en'
    encoding      = 'utf-8'
    ignore_duplicate_articles = {'title', 'url'}
    remove_empty_feeds    = True
    auto_cleanup = True
    
    recursions = 1
    
    
# presently this is set to download the whole group of blogs for the feed.  If you want 
# to limit it to the specific sections of the blog that you want to download.

    feeds          = [
                      (u'Creative Blog', u'http://www.creativebloq.com/feed/'),
#                      (u'3D', u'http://www.creativebloq.com/feed/3d'),
#                      (u'Adobe', u'http://www.creativebloq.com/feed/adobe'),
#                      (u'Animation', u'http://www.creativebloq.com/feed/animation'),
#                      (u'Apple', u'http://www.creativebloq.com/feed/apple'),
#                      (u'Branding', u'http://www.creativebloq.com/feed/branding'),
#                      (u'Graphic Design', u'http://www.creativebloq.com/feed/graphic-design'),
#                      (u'Illustration', u'http://www.creativebloq.com/feed/illustration'),
#                      (u'News', u'http://www.creativebloq.com/feed/news'),
#                      (u'Opinion', u'http://www.creativebloq.com/feed/opinion'),
#                      (u'Tutorials', u'http://www.creativebloq.com/feed/tutorial'),
#                      (u'Typography', u'http://www.creativebloq.com/feed/typography'),
#                      (u'Video', u'http://www.creativebloq.com/feed/video'),
#                      (u'web design', u'http://www.creativebloq.com/feed/web-design'),
                     ]
                     
    def is_link_wanted(self, url, tag):
            ans = re.match(self, url, '<a href="*/page-*">') is not None
    if ans:
            self.log('Following multipage link: %s'%url)
    return ans

kovidgoyal · 03-17-2015, 12:52 AM

The url is simply the contents of the href attribute, so your regex needs to just match that. Or if you want to use the tag, then use something like

if tag.findParent(li, attrs={'class':'pager-next'}) is not None:

03-15-2015, 06:17 PM	#1
Camper65 Enthusiast Posts: 32 Karma: 10 Join Date: Apr 2011 Device: Kindle wifi; Dell 2in1	multi-page coding for Creative Blog I'm trying to do some updating to the Creative Blog recipe because I just found out there are a few articles that go to a second page and it's right now not pulling the second pages. So this is the modified recipe that doesn't work yet. Spoiler: __license__ = 'GPL v3' __copyright__ = '2014, Bonni Salles - post in forum for help' ''' Creative Blog (formerly .net magazine) ''' from calibre.web.feeds.news import BasicNewsRecipe class creativeblog(BasicNewsRecipe): title = u'Creative Blog (formerly .Net magazine)' __author__ = 'Bonni Salles' oldest_article = 7 publication_type = 'blog' max_articles_per_feed = 100 description = 'Web Design and Tutorials from Creative Blog (part of .Net Magazine and others)' publisher = 'Creative Blog' category = 'internet, web design' language = 'en' encoding = 'utf-8' ignore_duplicate_articles = {'title', 'url'} remove_empty_feeds = True auto_cleanup = True # presently this is set to download the whole group of blogs for the feed. If you want # to limit it to the specific sections of the blog that you want to download. feeds = [ (u'Creative Blog', u'http://www.creativebloq.com/feed/'), # (u'3D', u'http://www.creativebloq.com/feed/3d'), # (u'Adobe', u'http://www.creativebloq.com/feed/adobe'), # (u'Animation', u'http://www.creativebloq.com/feed/animation'), # (u'Apple', u'http://www.creativebloq.com/feed/apple'), # (u'Branding', u'http://www.creativebloq.com/feed/branding'), # (u'Graphic Design', u'http://www.creativebloq.com/feed/graphic-design'), # (u'Illustration', u'http://www.creativebloq.com/feed/illustration'), # (u'News', u'http://www.creativebloq.com/feed/news'), # (u'Opinion', u'http://www.creativebloq.com/feed/opinion'), # (u'Tutorials', u'http://www.creativebloq.com/feed/tutorial'), # (u'Typography', u'http://www.creativebloq.com/feed/typography'), # (u'Video', u'http://www.creativebloq.com/feed/video'), # (u'web design', u'http://www.creativebloq.com/feed/web-design'), ] def append_page(self, soup, appendtag, position, surl): pager = soup.find('li', attrs={'class':'pager-current first'}) if pager: nextpages = soup.findAll('li', attrs={'class':'pager-next'}) nextpage = nextpages[1] if nextpage and (nextpage['href'] != surl): nexturl = nextpage['href'] soup2 = self.index_to_soup(nexturl) texttag = soup2.find('li', attrs={'class':'pager-next'}) for it in texttag.findAll(style=True): del it['style'] newpos = len(texttag.contents) self.append_page(soup2,texttag,newpos,nexturl) texttag.extract() pager.extract() appendtag.insert(position,texttag) def preprocess_html(self, soup): self.append_page(soup, soup.body, 3, '') pager = soup.find('li', attrs={'class':'pager-current first'}) if pager: pager.extract() return self.adeify_images(soup) this is the html coding from one of the articles that shows how it links to the second page. One thing to note is that there are two spots that have the div class="item-list" as it's lead. Spoiler: <div class="item-list"><ul class="pager" data-pagenum="1"><li class="pager-current first">1</li> <li class="pager-item"><a href="/career/promote-art-online-31514434/page-2" rel="next" title="Go to page 2">2</a></li> <li class="pager-next"><a href="/career/promote-art-online-31514434/page-2" title="Go to next page" rel="next">next ›</a></li> <li class="pager-last last"><a href="/career/promote-art-online-31514434/page-2" title="Go to last page" rel="prev">last »</a></li> </ul></div> I think I have the type of coding that's needed, but if anyone has an easier way for the few articles to pull the second page, please let me know. Just to note, recursion doesn't work as it pulls a lot more links and creates a very big epub, already tried it. Last edited by Camper65; 03-15-2015 at 06:22 PM. Reason: needed to add about recursion

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Trouble with multi-page threads?	meeera	Feedback	3	02-24-2014 09:04 PM
Fetching multi-page articles	Steven630	Recipes	27	08-21-2012 11:04 PM
PRS-T1 Multi-page advance...any remedy??	petercreasey	Sony Reader	24	06-02-2012 04:38 PM
Multi page possible?	ProDigit	Sigil	11	12-30-2011 01:13 AM
help! how to handle multi page topic	zhixiangpan	Recipes	4	08-31-2011 10:46 PM

03-16-2015, 12:43 AM	#2
kovidgoyal creator of calibre Posts: 45,615 Karma: 28549044 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Dont use append_page(). Set recursions = 1 and use is_link_wanted() instead.

03-17-2015, 12:52 AM	#4
kovidgoyal creator of calibre Posts: 45,615 Karma: 28549044 Join Date: Oct 2006 Location: Mumbai, India Device: Various	The url is simply the contents of the href attribute, so your regex needs to just match that. Or if you want to use the tag, then use something like if tag.findParent(li, attrs={'class':'pager-next'}) is not None:

Advert