Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 04-17-2016, 06:00 AM   #1
Aimylios
Member
Aimylios began at the beginning.
 
Posts: 17
Karma: 10
Join Date: Apr 2016
Device: Tolino Vision 3HD
Berlin Policy Journal recipe

Hi,

I have written a recipe for the Berlin Policy Journal.

Quote:
BERLIN POLICY JOURNAL is a bimonthly digital magazine on international affairs, edited in Germany’s capital. Each issue sets a theme and furthers debates, in addition to reporting on a diverse range of current and emerging foreign policy topics. We offer in-depth analysis and thought-provoking insights from leading thinkers and commentators, including extensive profiles of decision-makers from Europe and beyond.

http://berlinpolicyjournal.com/
Code:
#!/usr/bin/env python2
# vim:fileencoding=utf-8
from __future__ import unicode_literals

__license__   = 'GPL v3'
__copyright__ = '2016, Aimylios'

'''
berlinpolicyjournal.com
'''

import re, time
from calibre.web.feeds.news import BasicNewsRecipe

class BerlinPolicyJournal(BasicNewsRecipe):
    title            = 'Berlin Policy Journal'
    __author__       = 'Aimylios'
    description      = 'Articles from berlinpolicyjournal.com'
    publisher        = 'Deutsche Gesellschaft für Auswärtige Politik e.V.'
    publication_type = 'magazine'
    language         = 'en_DE'

    oldest_article         = 75
    max_articles_per_feed  = 30
    simultaneous_downloads = 10
    no_stylesheets         = True
    remove_javascript      = True

    conversion_options = {'smarten_punctuation' : True,
                          'publisher'           : publisher}

    # uncomment this to reduce file size
    #compress_news_images          = True
    #compress_news_images_max_size = 16

    INDEX        = 'http://berlinpolicyjournal.com/'
    FRONTPAGE    = INDEX + 'page/'
    cover_source = INDEX
    masthead_url = INDEX + 'IP/wp-content/uploads/2015/04/logo_bpj_header.gif'

    keep_only_tags = [dict(name='article')]

    remove_tags = [
        dict(name='div', attrs={'class':['hidden', 'meta-count', 'meta-share']}),
        dict(name='span', attrs={'class':'ava-auth'}),
        dict(name='img', attrs={'alt':re.compile("_store_120px_width$")}),
        dict(name='img', attrs={'alt':re.compile("^bpj_app_")}),
        dict(name='img', attrs={'alt':re.compile("^BPJ-Montage_")}),
        dict(name='footer'),
        dict(name='br')
    ]

    remove_attributes = ['sizes', 'width', 'height', 'align']

    extra_css = 'h1 {font-size: 1.6em; text-align: left} \
                 .entry-subtitle {font-style: italic; margin-bottom: 1em} \
                 .wp-caption-text {font-size: 0.6em; margin-top: 0em}'

    def get_cover_url(self):
        soup = self.index_to_soup(self.cover_source)
        img_div = soup.find('div', id='text-2')
        self.cover_url = img_div.find('img', src=True)['src']
        return self.cover_url

    def parse_index(self):
        articles = {}
        for i in range(1,5):
            soup = self.index_to_soup(self.FRONTPAGE + str(i))
            for div in soup.findAll('div', attrs={'class':'post-box-big'}):
                timestamp = time.strptime(div.find('time')['datetime'], '%Y-%m-%dT%H:%M:%S+00:00')
                article_age = time.time() - time.mktime(timestamp)
                if article_age <= self.oldest_article*24*3600:
                    category = self.tag_to_string(div.findAll('a', attrs={'rel':'category'})[-1])
                    if not category in articles:
                        articles[category] = []
                    article_title = self.tag_to_string(div.find('h3', attrs={'class':'entry-title'}).a)
                    article_url   = div.find('h3', attrs={'class':'entry-title'}).a['href']
                    article_date  = unicode(time.strftime(' [%a, %d %b %H:%M]', timestamp))
                    article_desc  = self.tag_to_string(div.find('div', attrs={'class':'i-summary'}).p)
                    articles[category].append({'title':article_title,
                                               'url':article_url,
                                               'date':article_date,
                                               'description':article_desc})
        feeds = []
        for feed in articles:
            if '/' in feed:
                feeds.insert(0, (feed, articles[feed]))
            else:
                feeds.append((feed, articles[feed]))
        return feeds

    def postprocess_html(self, soup, first_fetch):
        # clean up formatting of author(s) and date
        div = soup.find('div', {'class':'meta-info'})
        authors = ''
        for entry in div.findAll('span', {'class':'entry-author'}):
            authors = authors + entry.a.span.renderContents().strip() + ', '
        date = div.find('time').renderContents().strip()
        div.replaceWith('<div>' + authors[:-2] + ' (' + date + ')<br/></div>')
        return soup
Aimylios is offline   Reply With Quote
Old 04-17-2016, 08:47 AM   #2
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 45,251
Karma: 27110894
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
I have added your recipe, one small tip. You can avoid loading the index page twice by simply detecting the cover image url in parse_index() and setting self.cover_url = whatever there instead of doing it again in get_cover_url()
kovidgoyal is online now   Reply With Quote
Advert
Old 04-24-2016, 12:12 PM   #3
Aimylios
Member
Aimylios began at the beginning.
 
Posts: 17
Karma: 10
Join Date: Apr 2016
Device: Tolino Vision 3HD
Hi Kovid,

thanks for the tip! I have modified the code accordingly.

Changelog:
  • removed get_cover_url()
  • reduced simultaneous_downloads to avoid sporadic download problems
  • code cleanup

Code:
#!/usr/bin/env python2
# vim:fileencoding=utf-8
from __future__ import unicode_literals

__license__   = 'GPL v3'
__copyright__ = '2016, Aimylios'

'''
berlinpolicyjournal.com
'''

import re, time
from calibre.web.feeds.news import BasicNewsRecipe

class BerlinPolicyJournal(BasicNewsRecipe):
    title            = 'Berlin Policy Journal'
    __author__       = 'Aimylios'
    description      = 'Articles from berlinpolicyjournal.com'
    publisher        = 'Deutsche Gesellschaft für Auswärtige Politik e.V.'
    publication_type = 'magazine'
    language         = 'en_DE'

    oldest_article         = 75
    max_articles_per_feed  = 30
    simultaneous_downloads = 5
    no_stylesheets         = True
    remove_javascript      = True

    conversion_options = {'smarten_punctuation' : True,
                          'publisher'           : publisher}

    # uncomment this to reduce file size
    # compress_news_images          = True
    # compress_news_images_max_size = 16

    INDEX        = 'http://berlinpolicyjournal.com/'
    masthead_url = INDEX + 'IP/wp-content/uploads/2015/04/logo_bpj_header.gif'

    keep_only_tags = [dict(name='article')]

    remove_tags = [
        dict(name='div', attrs={'class':['hidden', 'meta-count', 'meta-share']}),
        dict(name='span', attrs={'class':'ava-auth'}),
        dict(name='img', attrs={'alt':re.compile('_store_120px_width$')}),
        dict(name='img', attrs={'alt':re.compile('^bpj_app_')}),
        dict(name='img', attrs={'alt':re.compile('^BPJ-Montage_')}),
        dict(name='footer'),
        dict(name='br')
    ]

    remove_attributes = ['sizes', 'width', 'height', 'align']

    extra_css = 'h1 {font-size: 1.6em; text-align: left} \
                 .entry-subtitle {font-style: italic; margin-bottom: 1em} \
                 .wp-caption-text {font-size: 0.6em; margin-top: 0em}'

    def parse_index(self):
        articles = {}
        for i in range(1,5):
            soup = self.index_to_soup(self.INDEX + 'page/' + str(i))
            if i == 1:
                img_div = soup.find('div', {'id':'text-2'})
                self.cover_url = img_div.find('img', src=True)['src']
            for div in soup.findAll('div', {'class':'post-box-big'}):
                timestamp = time.strptime(div.find('time')['datetime'], '%Y-%m-%dT%H:%M:%S+00:00')
                article_age = time.time() - time.mktime(timestamp)
                if article_age <= self.oldest_article*24*3600:
                    category = self.tag_to_string(div.findAll('a', {'rel':'category'})[-1])
                    if category not in articles:
                        articles[category] = []
                    article_title = self.tag_to_string(div.find('h3', {'class':'entry-title'}).a)
                    article_url   = div.find('h3', {'class':'entry-title'}).a['href']
                    article_date  = unicode(time.strftime(' [%a, %d %b %H:%M]', timestamp))
                    article_desc  = self.tag_to_string(div.find('div', {'class':'i-summary'}).p)
                    articles[category].append({'title':article_title,
                                               'url':article_url,
                                               'date':article_date,
                                               'description':article_desc})
        feeds = []
        for feed in articles:
            if '/' in feed:
                feeds.insert(0, (feed, articles[feed]))
            else:
                feeds.append((feed, articles[feed]))
        return feeds

    def postprocess_html(self, soup, first_fetch):
        # clean up formatting of author(s) and date
        div = soup.find('div', {'class':'meta-info'})
        authors = ''
        for entry in div.findAll('span', {'class':'entry-author'}):
            authors = authors + entry.a.span.renderContents().strip() + ', '
        date = div.find('time').renderContents().strip()
        div.replaceWith('<div>' + authors[:-2] + ' (' + date + ')<br/></div>')
        return soup
Aimylios is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Foreign Policy recipe broken mitra Recipes 2 02-22-2016 01:48 AM
Democracy Journal recipe? davidnye Recipes 3 02-26-2013 08:09 AM
New Journal of Physics recipe chemacortes Recipes 0 01-05-2011 08:08 AM
Poughkeepsie Journal recipe weebl Recipes 0 12-02-2010 08:56 AM
New England of Journal recipe Ebookerr Calibre 1 08-26-2010 04:59 AM


All times are GMT -4. The time now is 11:46 PM.


MobileRead.com is a privately owned, operated and funded community.