Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 04-23-2019, 08:49 PM   #1
alan smith
Junior Member
alan smith began at the beginning.
 
Posts: 7
Karma: 10
Join Date: Apr 2019
Device: kindle
help for the recipe of globaltimes

calibre is so powerful, but recently i find there is something wrong with the recipe of globaltimes(doesn't work).I was wondering if someone could make some updates?the site adreess is
http://www.globaltimes.cn/
The version of the software I'm using is 3.41 ,thanks a lot.
alan smith is offline   Reply With Quote
Old 04-24-2019, 12:59 AM   #2
lui1
Enthusiast
lui1 began at the beginning.
 
Posts: 36
Karma: 10
Join Date: Dec 2017
Location: Los Angeles, CA
Device: Smart Phone
Global News Update

It looks like the links were broken and the structure of the markup changed. So it needed to be redone.

Update for Global News:
Code:
from calibre.web.feeds.news import BasicNewsRecipe
import re


def check_words(words):
    return lambda x: x and frozenset(words.split()).intersection(x.split())


class GlobalTimes(BasicNewsRecipe):
    title = u'Global Times'
    __author__ = 'Jose Ortiz' # lui1 at mobileread.com
    language = 'en_CN'
    oldest_article = 7
    max_articles_per_feed = 100
    no_stylesheets = True
    keep_only_tags = [
        { 'class': check_words('article-title article-source row-content') },
    ]

    extra_css = '''
        .article-title {
            font-family:Arial,Helvetica,sans-serif;
            font-weight:bold;font-size:large;
        }

        .article-source, .row-content {
            font-family:Arial,Helvetica,sans-serif;
            font-size:small;
        }
        '''

    def parse_index(self):
        catnames = {}
        catnames["http://www.globaltimes.cn/china/politics/"] = "China Politics"
        catnames["http://www.globaltimes.cn/china/diplomacy/"] = "China Diplomacy"
        catnames["http://www.globaltimes.cn/china/military/"] = "China Military"
        catnames["http://www.globaltimes.cn/business/economy/"] = "China Economy"
        catnames["http://www.globaltimes.cn/world/asia-pacific/"] = "Asia Pacific"
        feeds = []

        for cat in catnames.keys():
            articles = []
            soup = self.index_to_soup(cat)
            for a in soup.findAll('a', attrs={'href': re.compile(r'https?://www.globaltimes.cn/content/[0-9]{4,10}[.]shtml')}):
                url = a['href'].strip()                # Typical url http://www.globaltimes.cn/content/5555555.shtml
                title = self.tag_to_string(a).strip()
                if not title:
                    continue
                myarticle = ({'title': title,
                              'url': url,
                              'description': '',
                              'date': ''})
                self.log("found '%s'" % title)
                articles.append(myarticle)
                self.log("Adding URL %s\n" % url)
            if articles:
                feeds.append((catnames[cat], articles))
        return feeds
lui1 is offline   Reply With Quote
Old 04-24-2019, 05:45 AM   #3
alan smith
Junior Member
alan smith began at the beginning.
 
Posts: 7
Karma: 10
Join Date: Apr 2019
Device: kindle
Your help is greatly appreciated!I just tested it and found the recipe workes perfectly, but there are too many blank between paragraphs in the article, which affected the reading experience.It would be perfect if this problem could be fixed ! thanks again.
alan smith is offline   Reply With Quote
Old 04-24-2019, 03:20 PM   #4
lui1
Enthusiast
lui1 began at the beginning.
 
Posts: 36
Karma: 10
Join Date: Dec 2017
Location: Los Angeles, CA
Device: Smart Phone
Update for the Global Times

Your welcome . The problem is that they are using br line breaks instead of p elements to seperate there paragraphs. This update should improve the layout.

Update for Global Times:
Code:
from calibre.web.feeds.news import BasicNewsRecipe
import re


def classes(classes):
    q = frozenset(classes.split(' '))
    return dict(
        attrs={'class': lambda x: x and frozenset(x.split()).intersection(q)}
    )


class GlobalTimes(BasicNewsRecipe):
    title = u'Global Times'
    __author__ = 'Jose Ortiz'  # lui1 at mobileread.com
    language = 'en_CN'
    oldest_article = 7
    max_articles_per_feed = 100
    no_stylesheets = True
    keep_only_tags = [classes('article-title article-source row-content')]

    preprocess_regexps = [
        (re.compile(r'(?:<(?:br(?:\s*/)?|/br\s*)>(?:\s|'
                    '\xA0' r'|&nbsp;)*){2,9}',
                    re.U | re.I),
         lambda match: '<p>')
    ]

    extra_css = '''
        :root {
            font-family: Arial, Helvetica, sans-serif;
        }

        .article-title {
            font-weight: bold;
            font-size: large;
        }

        .article-source, .row-content {
            font-size:small;
        }
        '''

    def parse_index(self):
        catnames = {}
        catnames["http://www.globaltimes.cn/china/politics/"] = "China Politics"
        catnames["http://www.globaltimes.cn/china/diplomacy/"] = "China Diplomacy"
        catnames["http://www.globaltimes.cn/china/military/"] = "China Military"
        catnames["http://www.globaltimes.cn/business/economy/"] = "China Economy"
        catnames["http://www.globaltimes.cn/world/asia-pacific/"] = "Asia Pacific"
        feeds = []

        for cat in catnames.keys():
            articles = []
            soup = self.index_to_soup(cat)
            for a in soup.findAll(
                'a',
                attrs={
                    'href':
                    re.compile(
                        r'https?://www.globaltimes.cn/content/[0-9]{4,10}[.]shtml'
                    )
                }
            ):
                # Typical url http://www.globaltimes.cn/content/5555555.shtml
                url = a['href'].strip()
                title = self.tag_to_string(a).strip()
                if not title:
                    continue
                myarticle = ({
                    'title': title,
                    'url': url,
                    'description': '',
                    'date': ''
                })
                self.log("found '%s'" % title)
                articles.append(myarticle)
                self.log("Adding URL %s\n" % url)
            if articles:
                feeds.append((catnames[cat], articles))
        return feeds

    def postprocess_html(self, soup, first_fetch):
        for p in [p for p in soup('p') if len(p) == 0]:
            p.extract()
        return soup

Last edited by lui1; 04-24-2019 at 07:27 PM. Reason: made a few more improvements to the recipe
lui1 is offline   Reply With Quote
Old 04-24-2019, 09:48 PM   #5
alan smith
Junior Member
alan smith began at the beginning.
 
Posts: 7
Karma: 10
Join Date: Apr 2019
Device: kindle
Thanks so much, and thanks for responding so quickly.
alan smith is offline   Reply With Quote
Reply

Tags
recipes


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Per-recipe settings without editing the recipe? bobbysteel Recipes 3 03-05-2017 07:40 AM
Recipe for Het Laatste Nieuws (Belgian newspaper) based on built in recipe of Darko M erkfuizfeuadjfjz Recipes 0 02-17-2017 03:11 PM
Recipe voor De Tijd (Belgian newspaper) based on built in recipe of Darko Miletic erkfuizfeuadjfjz Recipes 0 02-17-2017 02:43 PM
ft recipe financial_times_us.recipe piet8stevens Recipes 3 03-05-2016 03:55 AM
Recipe works when mocked up as Python file, fails when converted to Recipe ode Recipes 7 09-04-2011 04:57 AM


All times are GMT -4. The time now is 01:46 AM.


MobileRead.com is a privately owned, operated and funded community.