Failed to fetch multipage articles

Susa · 03-22-2019, 07:06 AM

Hello, I have tried to fetch the articles on https://language.chinadaily.com.cn/5...03f6866ee845c/
but I only got the first pages.The append_page didn't seem to work. I wonder if anyone can help me with the recipe.

Spoiler:

lui1 · 03-23-2019, 09:13 PM

Hello there,

A China Daily recipe in calibre builtins already exists, but it is an English only version. This one seems to be Chinese interleaved with English throughout the text. I hope this helps.

China Daily (Chinese-English):

Code:

#!/usr/bin/env python2
# vim:fileencoding=utf-8
from __future__ import unicode_literals, division, absolute_import, print_function
from calibre.web.feeds.news import BasicNewsRecipe


PAGE_LIMIT = 50


def absurl(url):
    if url.startswith('//'):
        return 'https:' + url
    elif url.startswith('/'):
        return 'https://language.chinadaily.com.cn' + url
    return url


class ChinaDailyCN_EN(BasicNewsRecipe):
    title = u'权威发布CD'
    __author__ = 'Jose Ortiz'
    description = 'From China Daily'
    encoding = 'utf-8'
    language = 'zh'
    no_stylesheets = True
    remove_javascript = True
    keep_only_tags = [
        dict(name='div', attrs={'class':'main_title'}),
        dict(name='div', attrs={'class':'mian_txt'}),
        dict(name='span', attrs={'class':'next'})
    ]

    def parse_index(self):
        site = 'https://language.chinadaily.com.cn/5af95d44a3103f6866ee845c/'
        soup = self.index_to_soup(site)
        plist = soup.findAll('p',{'class':'gy_box_txt2' })
        articles = []
        for a in [p.a for p in plist if p.a]:
            title = self.tag_to_string(a)
            url = absurl(a["href"])
            articles.append({'title': title, 'url': url})
        return [('Articles', articles)]

    def preprocess_html(self, soup):
        try:
            span_next = soup.find('span',{'class':'next'})
            nexturl = absurl(span_next.find('a',{'class':'pagestyle'})['href'])
        except:
            self.log('No extra pages for this one.')
            return self.adeify_images(soup)

        span_next.extract()
        self.log('Found extra page(2) at',nexturl)
        cache = []
        for i in range(PAGE_LIMIT):
            soup2 = self.index_to_soup(nexturl)
            texttag = soup2.find('div',{'class':'mian_txt'})
            texttag.extract()
            cache.insert(0, texttag)
            try:
                span_next = soup2.find('span',{'class':'next'})
                nexturl = absurl(span_next.find('a',{'class':'pagestyle'})['href'])
                self.log('Found extra page(' + unicode(i + 3) + ') at',nexturl)
            except: break
        else:
            self.log.debug('Exhausted page limit of',PAGE_LIMIT)

        div = soup.body.find('div',{'class':'mian_txt'})
        index = 1 + div.parent.contents.index(div)
        for tag in cache:
            div.parent.insert(index,tag)

        return self.adeify_images(soup)

Susa · 03-25-2019, 12:49 AM

Yes, they are bilingual documents and speech transcripts. Thank you very much. It works now!

03-22-2019, 07:06 AM	#1
Susa Junior Member Posts: 2 Karma: 10 Join Date: Jan 2019 Device: Kindle Paperwhite 3	Failed to fetch multipage articles Hello, I have tried to fetch the articles on https://language.chinadaily.com.cn/5...03f6866ee845c/ but I only got the first pages.The append_page didn't seem to work. I wonder if anyone can help me with the recipe. Spoiler: # -- coding: utf-8 -- from calibre.web.feeds.news import BasicNewsRecipe class shuang1(BasicNewsRecipe): title = u'权威发布CD' description = 'From China Daily' encoding = 'utf-8' no_stylesheets = True remove_javascript = True keep_only_tags = [dict(name='div', attrs={'class':'main_title'}), dict(name='div', attrs={'class':'mian_txt'})] def get_title(self, link): return link.contents[0].strip() def parse_index(self): site = 'https://language.chinadaily.com.cn/5af95d44a3103f6866ee845c/' soup = self.index_to_soup(site) div = soup.findAll('p', { 'class': 'gy_box_txt2' }) articles = [] for link in div: til = link.a.contents[0].strip() url = 'https:' + link.a.get("href") a = { 'title': til, 'url': url } articles.append(a) ans = [(til, articles)] return ans def append_page(self, soup, appendtag, position): pager = soup.find('a', attrs={'class':'pagestyle'}) if pager: nexturl = 'https:' + pager['href'] soup2 = self.index_to_soup(nexturl) texttag = soup2.find('div', attrs={'class':'mian_txt'}) newpos = len(texttag.contents) self.append_page(soup2,texttag,newpos) texttag.extract() appendtag.insert(position,texttag) def preprocess_html(self, soup): self.append_page(soup, soup.body, 3) pager = soup.find('a', attrs={'class':'pagestyle'}) if pager: pager.extract() return self.adeify_images(soup)

Thread Tools	Search this Thread
Show Printable Version Email this Page	Search this Thread: Advanced Search

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Medscape: failed fetch news	Barry6	Recipes	5	04-25-2015 09:31 AM
How to treat multipage articles?	flyingfoxlee	Recipes	2	12-29-2012 06:38 AM
Failed to fetch news	Hemant	Calibre	10	08-25-2010 09:22 AM
Calibre, Instapaper, multipage articles and ordering	flyash	Calibre	1	06-10-2010 07:03 PM
Failed to Fetch Economist	wayner	Calibre	10	12-19-2009 12:30 AM

03-25-2019, 12:49 AM	#3
Susa Junior Member Posts: 2 Karma: 10 Join Date: Jan 2019 Device: Kindle Paperwhite 3	Yes, they are bilingual documents and speech transcripts. Thank you very much. It works now!

Advert