View Single Post
Old 12-29-2019, 08:46 AM   #1
Argel
Opinionated [but right]
Argel is no ebook tyro.Argel is no ebook tyro.Argel is no ebook tyro.Argel is no ebook tyro.Argel is no ebook tyro.Argel is no ebook tyro.Argel is no ebook tyro.Argel is no ebook tyro.Argel is no ebook tyro.Argel is no ebook tyro.
 
Argel's Avatar
 
Posts: 281
Karma: 1412
Join Date: Apr 2008
Location: UK
Device: Cybook Gen3, PRS 505, Kindle Int, Oasis, Paperwhite, Scribe
Updated London Review of Books (subscriber)

OK, here is my amateur reworking of Kovid's latest LRB script.

Changes are:
  • Successfully retrieves specified archive copies. It requires the manual entry of the Volume and Edition number [in 2-digit format] of the desired issue into the script. Getting back-issues was the main object in making the changes.
  • Volume and edition are included in the title for filing purposes.
  • High resolution cover retrieved for archived editions, not the low-res thumbnail from the archive edition front page.
  • Annoying address for letters removed from the end of every article article.
  • Missing author information link re-added to end of articles.

I've had the temerity to add my name to the authors, purely because if anything goes pear-shaped it will undoubtedly be something I've changed and you'll know who to blame.

Desirable changes might include reformatting the article titles in sans but that's a mystery to me.

No warranty as to suitability is offered!

Argel

Code:
#!/usr/bin/env python2
# vim:fileencoding=utf-8
# License: GPLv3 Copyright: 2019, Kovid Goyal <kovid at kovidgoyal.net>
from calibre.web.feeds.news import BasicNewsRecipe


# Insert correct volume and edition number here
volume_number = '41'
edition_number = '22'
archive_url='https://www.lrb.co.uk/the-paper/v' + volume_number + '/n' + edition_number

def classes(classes):
    q = frozenset(classes.split(' '))
    return dict(attrs={
        'class': lambda x: x and frozenset(x.split()).intersection(q)})


def absolutize(href):
    if href.startswith('/'):
        href =  'https://www.lrb.co.uk' + href
    return href


class LondonReviewOfBooksPayed(BasicNewsRecipe):
    title = 'London Review of Books, Volume ' + volume_number + ', Number ' + edition_number
    __author__ = 'Kovid Goyal, David Lawrence'
    description = 'Literary review publishing essay-length book reviews and topical articles on politics, literature, history, philosophy, science and the arts by leading writers and thinkers'  # noqa
    category = 'news, literature, UK'
    publisher = 'LRB Ltd.'
    language = 'en_GB'
    no_stylesheets = True
    delay = 1
    encoding = 'utf-8'
    INDEX = 'https://www.lrb.co.uk'
    publication_type = 'magazine'
    needs_subscription = True
    requires_version = (3, 0, 0)

    keep_only_tags = [
        classes('article-header--title paperArticle-reviewsHeader article-content article-letters-inner contributor-pane'),
    ]
 
    remove_tags    = [
        classes('social-button article-mask lrb-readmorelink article-send-letter article-share'),
    ]
 
    def get_browser(self):
        br = BasicNewsRecipe.get_browser(self)
        if self.username and self.password:
            br.open('https://www.lrb.co.uk/login')
            br.select_form(id='login_form')
            br['_username'] = self.username
            br['_password'] = self.password
            raw = br.submit().read()
            if b'>My Account<' not in raw:
                raise ValueError('Failed to login check username and password')
        return br

    def parse_index(self):
        articles = []
        soup = self.index_to_soup(archive_url)
        container = soup.find(attrs={'class': 'lrb-content-container'})
        img = container.find('img')
        self.cover_url = img['data-srcset'].split()[-2]
        h3 = container.find('h3')
        self.timefmt = ' [{}]'.format(self.tag_to_string(h3))
        a = img.findParent('a')
        soup = self.index_to_soup(archive_url)
        grid = soup.find(attrs={'class': 'toc-grid-items'})
        articles = []
        for a in grid.findAll(**classes('toc-item')):
            url = absolutize(a['href'])
            h3 = a.find('h3')
            h4 = a.find('h4')
            title = '{}: {}'.format(self.tag_to_string(h3), self.tag_to_string(h4))
            self.log(title, url)
            articles.append({'title': title, 'url': url})

        return [('Articles', articles)]

Last edited by Argel; 12-29-2019 at 10:35 AM.
Argel is offline   Reply With Quote