Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 11-29-2010, 03:00 PM   #1
c0llin
Junior Member
c0llin began at the beginning.
 
Posts: 1
Karma: 10
Join Date: Nov 2010
Device: Kindle 2gw
Reason Magazine request

Hello,
Im would like to replicate the print edition of Reason for my kindle2

The rss http://feeds.feedburner.com/reason/Articles updates daily with new articles. http://reason.com/rss gives a large list of feeds by staff writer, topic, etc.

Can i recreate the print edition with the rss or http://reason.com/issues/december-2010 which lays everything out as links?

Thanks
c0llin is offline   Reply With Quote
Old 03-25-2022, 03:29 PM   #2
howardgo
Member
howardgo began at the beginning.
 
Posts: 11
Karma: 10
Join Date: Apr 2020
Device: Kobo Aura One
I have created a recipe that works when pointed at an archived edition of the magazine. But when I set Needs Subscription to True, it does not download any of the articles. The current issue is May 2022 and is only available to subscribers.

I have tested my login on the website itself, so I know that is working. I also don't get a login error so it seems that my login is working. However, none of the links become active. It almost feels like my login works but then the recipe attempts to download according to the recipe before the links are activated.

Does anyone have any tips for troubleshooting subscription recipes? Testing only seems to work on without subscriptions, so I can only test on archived issues which works. I have included my recipe below. I modified recipe for The Atlantic to start with. Thanks for any help anyone can provide.

Code:
#!/usr/bin/env python
# vim:fileencoding=utf-8
# License: GPLv3 Copyright: 2015, Kovid Goyal <kovid at kovidgoyal.net>
from __future__ import unicode_literals
import json
from xml.sax.saxutils import escape, quoteattr

from calibre.web.feeds.news import BasicNewsRecipe

# {{{ parse article JSON
def process_image_block(lines, block):
    caption = block.get('captionText')
    caption_lines = []
    if caption:
        if block.get('attributionText', '').strip():
            caption += ' (' + block['attributionText'] + ')'
        caption_lines.append('<p style="font-style: italic">' + caption + '</p>')
    lines.append('<div style="text-align: center"><img src={}/>'.format(quoteattr(block['url'])))
    lines.extend(caption_lines)
    lines.append('</div>')


def json_to_html(raw):
    data = json.loads(raw)
    # open('/t/p.json', 'w').write(json.dumps(data, indent=2))
    data = sorted((v['data'] for v in data['props']['pageProps']['urqlState'].values()), key=len)[-1]
    article = json.loads(data)['article']
    lines = []
    lines.append('<h1 style="align: center">' + escape(article['title']) + '</h1>')
    lines.append('<h2 style="align: center">' + escape(article['dek']) + '</h2>')
    auts = ', '.join(x['displayName'] for x in article['authors'])
    if auts:
        lines.append('<p style="align: center">by ' + escape(auts) + '</p>')
    if article.get('leadArt') and 'image' in article['leadArt']:
        process_image_block(lines, article['leadArt']['image'])
    for item in article['content']:
        tn = item.get('__typename', '')
        if tn.endswith('Image'):
            process_image_block(lines, item)
            continue
        html = item.get('innerHtml')
        if html is None or '</iframe>' in html:
            continue
        if 'innerHtml' not in item:
            continue
        tagname = item.get('tagName', 'P').lower()
        lines.append('<{0}>{1}</{0}>'.format(tagname, html))
    return '<html><body><div id="from-json-by-calibre">' + '\n'.join(lines) + '</div></body></html>'


class NoJSON(ValueError):
    pass


def extract_html(soup):
    script = soup.findAll('script', id='__NEXT_DATA__')
    if not script:
        raise NoJSON('No script tag with JSON data found')
    raw = script[0].contents[0]
    return json_to_html(raw)

# }}}


def classes(classes):
    q = frozenset(classes.split(' '))
    return dict(
        attrs={'class': lambda x: x and frozenset(x.split()).intersection(q)}
    )


def prefix_classes(classes):
    q = classes.split()

    def test(x):
        if x:
            for cls in x.split():
                for c in q:
                    if cls.startswith(c):
                        return True
        return False
    return dict(attrs={'class': test})


class Reason(BasicNewsRecipe):

    title = 'Reason'
    description = 'Free minds and free markets'
    INDEX = 'https://reason.com/magazine/'
    #INDEX = 'https://reason.com/issue/april-2022/'
    __author__ = 'Howard Cornett'
    language = 'en'
    encoding = 'utf-8'
    needs_subscription = True

    remove_tags = [
        classes(
            'next-post-link the-tags tag rcom-social tools comments-header-show logo-header navbar navbar-expanded-lg primary content-info sidebar magicSidebar advertisement logo entry-subtitle'
        ),
    ]

    no_stylesheets = True
    remove_attributes = ['style']
    extra_css = '''
        .credit { text-align: right; font-size: 75%; display: block }
        .figcaption { font-size: 75% }
        .caption { font-size: 75% }
        .lead-img { display: block }
        p.dropcap:first-letter {
        float: left; text-transform: uppercase; font-weight: bold; font-size: 5.55em; line-height: 0.83;
        margin: 0; padding-right: 7px; margin-bottom: -2px; text-align: center;
        }
    '''

    def get_browser(self):
        br = BasicNewsRecipe.get_browser(self)
        return br

    def preprocess_raw_html(self, raw_html, url):
        try:
            return extract_html(self.index_to_soup(raw_html))
        except NoJSON:
            self.log.warn('No JSON found in: {} falling back to HTML'.format(url))
        except Exception:
            self.log.exception('Failed to extract JSON data from: {} falling back to HTML'.format(url))
        return raw_html

    def preprocess_html(self, soup):
        for img in soup.findAll('img', attrs={'data-lazy-src': True}):
            # img['src'] = img['data-lazy-src'].split()[0]
            data_lazy_src = img['data-lazy-src']
            if ',' in data_lazy_src:
                img['src'] = data_lazy_src.split(',')[0]
            else:
                img['src'] = data_lazy_src.split()[0]
        return soup

    #def print_version(self, url):
        #ans = url.partition('?')[0] + '?single_page=true'
        #if '/video/' in ans:
            #ans = None
        #return ans

    def parse_index(self):
        soup = self.index_to_soup(self.INDEX)
        cover = soup.find('img', title=lambda value: value and value.startswith('Reason Magazine,'))
        if cover is not None:
            self.cover_url = cover['data-lazy-src']
        current_section, current_articles = 'Cover Story', []
        feeds = []
        for div in soup.findAll('div', attrs={'class': lambda x: x and set(x.split()).intersection({'issue-header-right', 'toc-category-list'})}):
            #print('The value of div is ', div)
            for h3 in div.findAll('h3', attrs={'class': True}):
                cls = h3['class']
                if hasattr(cls, 'split'):
                    cls = cls.split()
                if 'toc-department' in cls:
                    if current_articles:
                        feeds.append((current_section, current_articles))
                    current_articles = []
                    current_section = self.tag_to_string(h3)
                    self.log('\nFound section:', current_section)
                    title = h3.find_next_sibling().a.text
                    url = h3.find_next_sibling().a['href']
                    desc = h3.find_next_sibling().p.text
                    current_articles.append({
                        'title': title,
                        'url': url,
                        'description': desc
                    })
            for h2 in div.findAll('h2', attrs={'class': True}):
                cls = h2['class']
                if hasattr(cls, 'split'):
                    cls = cls.split()
                if 'toc-department' in cls:
                    if current_articles:
                        feeds.append((current_section, current_articles))
                    current_articles = []
                    current_section = self.tag_to_string(h2)
                    self.log('\nFound section:', current_section)
                for article in div.findAll('article', attrs={'class': True}):
                    h4 = article.find('h4')
                    if h4.a is not None:
                        title = h4.a.text
                        url = h4.a['href']
                    else:
                        title = ''
                        url = ''
                    p = article.find('p', class_='entry-subtitle')
                    desc = h4.find_next_sibling().text
                    current_articles.append({
                        'title': title,
                        'url': url,
                        'description': desc
                    })

        if current_articles:
            feeds.append((current_section, current_articles))
        return feeds


if __name__ == '__main__':
    import sys

    from calibre.ebooks.BeautifulSoup import BeautifulSoup
    print(extract_html(BeautifulSoup(open(sys.argv[-1]).read())))


calibre_most_common_ua = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.87 Safari/537.36'
howardgo is offline   Reply With Quote
Advert
Old 03-26-2022, 10:26 PM   #3
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,858
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
You need to actually login in get_browser() just adding needs_subscription is not enough
kovidgoyal is offline   Reply With Quote
Old 03-28-2022, 10:43 AM   #4
howardgo
Member
howardgo began at the beginning.
 
Posts: 11
Karma: 10
Join Date: Apr 2020
Device: Kobo Aura One
Success!

Thank you, Kovid! I totally missed that detail. I added that to my recipe and now it works! Here is the successful recipe.

Code:
#!/usr/bin/env python
# vim:fileencoding=utf-8
# License: GPLv3 Copyright: 2015, Kovid Goyal <kovid at kovidgoyal.net>
from __future__ import unicode_literals
import json
from xml.sax.saxutils import escape, quoteattr

from calibre.web.feeds.news import BasicNewsRecipe

# {{{ parse article JSON
def process_image_block(lines, block):
    caption = block.get('captionText')
    caption_lines = []
    if caption:
        if block.get('attributionText', '').strip():
            caption += ' (' + block['attributionText'] + ')'
        caption_lines.append('<p style="font-style: italic">' + caption + '</p>')
    lines.append('<div style="text-align: center"><img src={}/>'.format(quoteattr(block['url'])))
    lines.extend(caption_lines)
    lines.append('</div>')


def json_to_html(raw):
    data = json.loads(raw)
    # open('/t/p.json', 'w').write(json.dumps(data, indent=2))
    data = sorted((v['data'] for v in data['props']['pageProps']['urqlState'].values()), key=len)[-1]
    article = json.loads(data)['article']
    lines = []
    lines.append('<h1 style="align: center">' + escape(article['title']) + '</h1>')
    lines.append('<h2 style="align: center">' + escape(article['dek']) + '</h2>')
    auts = ', '.join(x['displayName'] for x in article['authors'])
    if auts:
        lines.append('<p style="align: center">by ' + escape(auts) + '</p>')
    if article.get('leadArt') and 'image' in article['leadArt']:
        process_image_block(lines, article['leadArt']['image'])
    for item in article['content']:
        tn = item.get('__typename', '')
        if tn.endswith('Image'):
            process_image_block(lines, item)
            continue
        html = item.get('innerHtml')
        if html is None or '</iframe>' in html:
            continue
        if 'innerHtml' not in item:
            continue
        tagname = item.get('tagName', 'P').lower()
        lines.append('<{0}>{1}</{0}>'.format(tagname, html))
    return '<html><body><div id="from-json-by-calibre">' + '\n'.join(lines) + '</div></body></html>'


class NoJSON(ValueError):
    pass


def extract_html(soup):
    script = soup.findAll('script', id='__NEXT_DATA__')
    if not script:
        raise NoJSON('No script tag with JSON data found')
    raw = script[0].contents[0]
    return json_to_html(raw)

# }}}


def classes(classes):
    q = frozenset(classes.split(' '))
    return dict(
        attrs={'class': lambda x: x and frozenset(x.split()).intersection(q)}
    )


def prefix_classes(classes):
    q = classes.split()

    def test(x):
        if x:
            for cls in x.split():
                for c in q:
                    if cls.startswith(c):
                        return True
        return False
    return dict(attrs={'class': test})


class Reason(BasicNewsRecipe):

    title = 'Reason'
    description = 'Free minds and free markets'
    INDEX = 'https://reason.com/magazine/'
    #INDEX = 'https://reason.com/issue/april-2022/'
    __author__ = 'Howard Cornett'
    language = 'en'
    encoding = 'utf-8'
    needs_subscription = True

    remove_tags = [
        classes(
            'next-post-link the-tags tag rcom-social tools comments-header-show logo-header navbar navbar-expanded-lg primary content-info sidebar magicSidebar advertisement logo entry-subtitle'
        ),
    ]

    no_stylesheets = True
    remove_attributes = ['style']
    extra_css = '''
        .credit { text-align: right; font-size: 75%; display: block }
        .figcaption { font-size: 75% }
        .caption { font-size: 75% }
        .lead-img { display: block }
        p.dropcap:first-letter {
        float: left; text-transform: uppercase; font-weight: bold; font-size: 5.55em; line-height: 0.83;
        margin: 0; padding-right: 7px; margin-bottom: -2px; text-align: center;
        }
    '''

    def get_browser(self):
        br = BasicNewsRecipe.get_browser(self)
        if self.username is not None and self.password is not None:
            br.open('https://reason.com/login')
            br.select_form(id='login_form')
            br['text_username'] = self.username
            br['password_password'] = self.password
            br.submit()
        return br

    def preprocess_raw_html(self, raw_html, url):
        try:
            return extract_html(self.index_to_soup(raw_html))
        except NoJSON:
            self.log.warn('No JSON found in: {} falling back to HTML'.format(url))
        except Exception:
            self.log.exception('Failed to extract JSON data from: {} falling back to HTML'.format(url))
        return raw_html

    def preprocess_html(self, soup):
        for img in soup.findAll('img', attrs={'data-lazy-src': True}):
            # img['src'] = img['data-lazy-src'].split()[0]
            data_lazy_src = img['data-lazy-src']
            if ',' in data_lazy_src:
                img['src'] = data_lazy_src.split(',')[0]
            else:
                img['src'] = data_lazy_src.split()[0]
        return soup

    #def print_version(self, url):
        #ans = url.partition('?')[0] + '?single_page=true'
        #if '/video/' in ans:
            #ans = None
        #return ans

    def parse_index(self):
        soup = self.index_to_soup(self.INDEX)
        cover = soup.find('img', title=lambda value: value and value.startswith('Reason Magazine,'))
        if cover is not None:
            self.cover_url = cover['src']
        current_section, current_articles = 'Cover Story', []
        feeds = []
        for div in soup.findAll('div', attrs={'class': lambda x: x and set(x.split()).intersection({'issue-header-right', 'toc-category-list'})}):
            #print('The value of div is ', div)
            for h3 in div.findAll('h3', attrs={'class': True}):
                cls = h3['class']
                if hasattr(cls, 'split'):
                    cls = cls.split()
                if 'toc-department' in cls:
                    if current_articles:
                        feeds.append((current_section, current_articles))
                    current_articles = []
                    current_section = self.tag_to_string(h3)
                    self.log('\nFound section:', current_section)
                    title = h3.find_next_sibling().a.text
                    url = h3.find_next_sibling().a['href']
                    desc = h3.find_next_sibling().p.text
                    current_articles.append({
                        'title': title,
                        'url': url,
                        'description': desc
                    })
            for h2 in div.findAll('h2', attrs={'class': True}):
                cls = h2['class']
                if hasattr(cls, 'split'):
                    cls = cls.split()
                if 'toc-department' in cls:
                    if current_articles:
                        feeds.append((current_section, current_articles))
                    current_articles = []
                    current_section = self.tag_to_string(h2)
                    self.log('\nFound section:', current_section)
                for article in div.findAll('article', attrs={'class': True}):
                    h4 = article.find('h4')
                    if h4.a is not None:
                        title = h4.a.text
                        url = h4.a['href']
                    else:
                        title = ''
                        url = ''
                    p = article.find('p', class_='entry-subtitle')
                    desc = h4.find_next_sibling().text
                    current_articles.append({
                        'title': title,
                        'url': url,
                        'description': desc
                    })

        if current_articles:
            feeds.append((current_section, current_articles))
        return feeds


if __name__ == '__main__':
    import sys

    from calibre.ebooks.BeautifulSoup import BeautifulSoup
    print(extract_html(BeautifulSoup(open(sys.argv[-1]).read())))


calibre_most_common_ua = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.87 Safari/537.36'
howardgo is offline   Reply With Quote
Old 03-28-2022, 01:04 PM   #5
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,858
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
https://github.com/kovidgoyal/calibr...5ae5466aa576f8
kovidgoyal is offline   Reply With Quote
Advert
Reply

Tags
reason


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
PRS-950 Any reason to switch it off?? petercreasey Sony Reader 7 11-04-2010 08:56 PM
The Magazine is Dead. Long Live the Magazine! kennyc News 17 11-01-2010 12:36 PM
Any reason for Lrf? GlenBarrington Sony Reader 13 08-05-2010 01:44 AM
Magazine FREEBIES - HUB magazine: Weekly Tales of SF Dr. Drib Deals and Resources (No Self-Promotion or Affiliate Links) 3 11-07-2009 12:07 PM
Reason For Calibre? CChamp Calibre 6 01-11-2009 02:58 PM


All times are GMT -4. The time now is 04:09 PM.


MobileRead.com is a privately owned, operated and funded community.