Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 03-15-2015, 05:17 PM   #1
Camper65
Enthusiast
Camper65 began at the beginning.
 
Posts: 32
Karma: 10
Join Date: Apr 2011
Device: Kindle wifi; Dell 2in1
multi-page coding for Creative Blog

I'm trying to do some updating to the Creative Blog recipe because I just found out there are a few articles that go to a second page and it's right now not pulling the second pages. So this is the modified recipe that doesn't work yet.

Spoiler:


__license__ = 'GPL v3'
__copyright__ = '2014, Bonni Salles - post in forum for help'
'''
Creative Blog (formerly .net magazine)
'''

from calibre.web.feeds.news import BasicNewsRecipe

class creativeblog(BasicNewsRecipe):
title = u'Creative Blog (formerly .Net magazine)'
__author__ = 'Bonni Salles'
oldest_article = 7
publication_type = 'blog'
max_articles_per_feed = 100
description = 'Web Design and Tutorials from Creative Blog (part of .Net Magazine and others)'
publisher = 'Creative Blog'
category = 'internet, web design'
language = 'en'
encoding = 'utf-8'
ignore_duplicate_articles = {'title', 'url'}
remove_empty_feeds = True
auto_cleanup = True
# presently this is set to download the whole group of blogs for the feed. If you want
# to limit it to the specific sections of the blog that you want to download.

feeds = [
(u'Creative Blog', u'http://www.creativebloq.com/feed/'),
# (u'3D', u'http://www.creativebloq.com/feed/3d'),
# (u'Adobe', u'http://www.creativebloq.com/feed/adobe'),
# (u'Animation', u'http://www.creativebloq.com/feed/animation'),
# (u'Apple', u'http://www.creativebloq.com/feed/apple'),
# (u'Branding', u'http://www.creativebloq.com/feed/branding'),
# (u'Graphic Design', u'http://www.creativebloq.com/feed/graphic-design'),
# (u'Illustration', u'http://www.creativebloq.com/feed/illustration'),
# (u'News', u'http://www.creativebloq.com/feed/news'),
# (u'Opinion', u'http://www.creativebloq.com/feed/opinion'),
# (u'Tutorials', u'http://www.creativebloq.com/feed/tutorial'),
# (u'Typography', u'http://www.creativebloq.com/feed/typography'),
# (u'Video', u'http://www.creativebloq.com/feed/video'),
# (u'web design', u'http://www.creativebloq.com/feed/web-design'),
]

def append_page(self, soup, appendtag, position, surl):
pager = soup.find('li', attrs={'class':'pager-current first'})
if pager:
nextpages = soup.findAll('li', attrs={'class':'pager-next'})
nextpage = nextpages[1]
if nextpage and (nextpage['href'] != surl):
nexturl = nextpage['href']
soup2 = self.index_to_soup(nexturl)
texttag = soup2.find('li', attrs={'class':'pager-next'})
for it in texttag.findAll(style=True):
del it['style']
newpos = len(texttag.contents)
self.append_page(soup2,texttag,newpos,nexturl)
texttag.extract()
pager.extract()
appendtag.insert(position,texttag)


def preprocess_html(self, soup):
self.append_page(soup, soup.body, 3, '')
pager = soup.find('li', attrs={'class':'pager-current first'})
if pager:
pager.extract()
return self.adeify_images(soup)


this is the html coding from one of the articles that shows how it links to the second page. One thing to note is that there are two spots that have the div class="item-list" as it's lead.

Spoiler:

<div class="item-list"><ul class="pager" data-pagenum="1"><li class="pager-current first">1</li>
<li class="pager-item"><a href="/career/promote-art-online-31514434/page-2" rel="next" title="Go to page 2">2</a></li>
<li class="pager-next"><a href="/career/promote-art-online-31514434/page-2" title="Go to next page" rel="next">next ›</a></li>
<li class="pager-last last"><a href="/career/promote-art-online-31514434/page-2" title="Go to last page" rel="prev">last »</a></li>
</ul></div>


I think I have the type of coding that's needed, but if anyone has an easier way for the few articles to pull the second page, please let me know. Just to note, recursion doesn't work as it pulls a lot more links and creates a very big epub, already tried it.

Last edited by Camper65; 03-15-2015 at 05:22 PM. Reason: needed to add about recursion
Camper65 is offline   Reply With Quote
Old 03-15-2015, 11:43 PM   #2
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 45,345
Karma: 27182818
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Dont use append_page(). Set recursions = 1 and use is_link_wanted() instead.
kovidgoyal is offline   Reply With Quote
Advert
Old 03-16-2015, 10:14 PM   #3
Camper65
Enthusiast
Camper65 began at the beginning.
 
Posts: 32
Karma: 10
Join Date: Apr 2011
Device: Kindle wifi; Dell 2in1
Kovid I modified it to use the recursive and is_link_wanted but can't get the is_link_wanted right. Can you help with what I should be using to find the url that corresponds to li class=pager-next or use the href that correspondence to the "next" entry so that I can get the second page (or more) properly. I am again posting the recipe so you can see the updates.

Spoiler:
__license__ = 'GPL v3'
Code:
__copyright__ = '2014, Bonni Salles - post in forum for help'
'''
Creative Blog (formerly .net magazine)
'''

from calibre.web.feeds.news import BasicNewsRecipe

class creativeblog(BasicNewsRecipe):
    title          = u'Creative Blog (formerly .Net magazine)'
    __author__     = 'Bonni Salles'
    oldest_article = 7
    publication_type = 'blog'
    max_articles_per_feed = 100
    description    = 'Web Design and Tutorials from Creative Blog (part of .Net Magazine and others)'
    publisher      = 'Creative Blog'
    category       = 'internet, web design'
    language       = 'en'
    encoding      = 'utf-8'
    ignore_duplicate_articles = {'title', 'url'}
    remove_empty_feeds    = True
    auto_cleanup = True
    
    recursions = 1
    
    
# presently this is set to download the whole group of blogs for the feed.  If you want 
# to limit it to the specific sections of the blog that you want to download.

    feeds          = [
                      (u'Creative Blog', u'http://www.creativebloq.com/feed/'),
#                      (u'3D', u'http://www.creativebloq.com/feed/3d'),
#                      (u'Adobe', u'http://www.creativebloq.com/feed/adobe'),
#                      (u'Animation', u'http://www.creativebloq.com/feed/animation'),
#                      (u'Apple', u'http://www.creativebloq.com/feed/apple'),
#                      (u'Branding', u'http://www.creativebloq.com/feed/branding'),
#                      (u'Graphic Design', u'http://www.creativebloq.com/feed/graphic-design'),
#                      (u'Illustration', u'http://www.creativebloq.com/feed/illustration'),
#                      (u'News', u'http://www.creativebloq.com/feed/news'),
#                      (u'Opinion', u'http://www.creativebloq.com/feed/opinion'),
#                      (u'Tutorials', u'http://www.creativebloq.com/feed/tutorial'),
#                      (u'Typography', u'http://www.creativebloq.com/feed/typography'),
#                      (u'Video', u'http://www.creativebloq.com/feed/video'),
#                      (u'web design', u'http://www.creativebloq.com/feed/web-design'),
                     ]
                     
    def is_link_wanted(self, url, tag):
            ans = re.match(self, url, '<a href="*/page-*">') is not None
    if ans:
            self.log('Following multipage link: %s'%url)
    return ans

Last edited by PeterT; 03-16-2015 at 10:33 PM. Reason: Wrapped code in [code] .. [/code] block
Camper65 is offline   Reply With Quote
Old 03-16-2015, 11:52 PM   #4
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 45,345
Karma: 27182818
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
The url is simply the contents of the href attribute, so your regex needs to just match that. Or if you want to use the tag, then use something like

if tag.findParent(li, attrs={'class':'pager-next'}) is not None:
kovidgoyal is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Trouble with multi-page threads? meeera Feedback 3 02-24-2014 08:04 PM
Fetching multi-page articles Steven630 Recipes 27 08-21-2012 10:04 PM
PRS-T1 Multi-page advance...any remedy?? petercreasey Sony Reader 24 06-02-2012 03:38 PM
Multi page possible? ProDigit Sigil 11 12-30-2011 12:13 AM
help! how to handle multi page topic zhixiangpan Recipes 4 08-31-2011 09:46 PM


All times are GMT -4. The time now is 08:33 PM.


MobileRead.com is a privately owned, operated and funded community.