Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 03-22-2019, 07:06 AM   #1
Susa
Junior Member
Susa began at the beginning.
 
Posts: 2
Karma: 10
Join Date: Jan 2019
Device: Kindle Paperwhite 3
Failed to fetch multipage articles


Hello, I have tried to fetch the articles on https://language.chinadaily.com.cn/5...03f6866ee845c/
but I only got the first pages.The append_page didn't seem to work. I wonder if anyone can help me with the recipe.


Spoiler:

# -*- coding: utf-8 -*-
from calibre.web.feeds.news import BasicNewsRecipe

class shuang1(BasicNewsRecipe):

title = u'权威发布CD'
description = 'From China Daily'
encoding = 'utf-8'
no_stylesheets = True
remove_javascript = True
keep_only_tags = [dict(name='div', attrs={'class':'main_title'}),
dict(name='div', attrs={'class':'mian_txt'})]
def get_title(self, link):
return link.contents[0].strip()
def parse_index(self):
site = 'https://language.chinadaily.com.cn/5af95d44a3103f6866ee845c/'
soup = self.index_to_soup(site)
div = soup.findAll('p', { 'class': 'gy_box_txt2' })
articles = []

for link in div:

til = link.a.contents[0].strip()
url = 'https:' + link.a.get("href")
a = { 'title': til, 'url': url }

articles.append(a)

ans = [(til, articles)]

return ans

def append_page(self, soup, appendtag, position):
pager = soup.find('a', attrs={'class':'pagestyle'})
if pager:
nexturl = 'https:' + pager['href']
soup2 = self.index_to_soup(nexturl)
texttag = soup2.find('div', attrs={'class':'mian_txt'})
newpos = len(texttag.contents)
self.append_page(soup2,texttag,newpos)
texttag.extract()
appendtag.insert(position,texttag)

def preprocess_html(self, soup):
self.append_page(soup, soup.body, 3)
pager = soup.find('a', attrs={'class':'pagestyle'})
if pager:
pager.extract()
return self.adeify_images(soup)

Susa is offline   Reply With Quote
Old 03-23-2019, 09:13 PM   #2
lui1
Enthusiast
lui1 began at the beginning.
 
Posts: 36
Karma: 10
Join Date: Dec 2017
Location: Los Angeles, CA
Device: Smart Phone
Recipe for China Daily

Hello there,

A China Daily recipe in calibre builtins already exists, but it is an English only version. This one seems to be Chinese interleaved with English throughout the text. I hope this helps.

China Daily (Chinese-English):
Code:
#!/usr/bin/env python2
# vim:fileencoding=utf-8
from __future__ import unicode_literals, division, absolute_import, print_function
from calibre.web.feeds.news import BasicNewsRecipe


PAGE_LIMIT = 50


def absurl(url):
    if url.startswith('//'):
        return 'https:' + url
    elif url.startswith('/'):
        return 'https://language.chinadaily.com.cn' + url
    return url


class ChinaDailyCN_EN(BasicNewsRecipe):
    title = u'权威发布CD'
    __author__ = 'Jose Ortiz'
    description = 'From China Daily'
    encoding = 'utf-8'
    language = 'zh'
    no_stylesheets = True
    remove_javascript = True
    keep_only_tags = [
        dict(name='div', attrs={'class':'main_title'}),
        dict(name='div', attrs={'class':'mian_txt'}),
        dict(name='span', attrs={'class':'next'})
    ]

    def parse_index(self):
        site = 'https://language.chinadaily.com.cn/5af95d44a3103f6866ee845c/'
        soup = self.index_to_soup(site)
        plist = soup.findAll('p',{'class':'gy_box_txt2' })
        articles = []
        for a in [p.a for p in plist if p.a]:
            title = self.tag_to_string(a)
            url = absurl(a["href"])
            articles.append({'title': title, 'url': url})
        return [('Articles', articles)]

    def preprocess_html(self, soup):
        try:
            span_next = soup.find('span',{'class':'next'})
            nexturl = absurl(span_next.find('a',{'class':'pagestyle'})['href'])
        except:
            self.log('No extra pages for this one.')
            return self.adeify_images(soup)

        span_next.extract()
        self.log('Found extra page(2) at',nexturl)
        cache = []
        for i in range(PAGE_LIMIT):
            soup2 = self.index_to_soup(nexturl)
            texttag = soup2.find('div',{'class':'mian_txt'})
            texttag.extract()
            cache.insert(0, texttag)
            try:
                span_next = soup2.find('span',{'class':'next'})
                nexturl = absurl(span_next.find('a',{'class':'pagestyle'})['href'])
                self.log('Found extra page(' + unicode(i + 3) + ') at',nexturl)
            except: break
        else:
            self.log.debug('Exhausted page limit of',PAGE_LIMIT)

        div = soup.body.find('div',{'class':'mian_txt'})
        index = 1 + div.parent.contents.index(div)
        for tag in cache:
            div.parent.insert(index,tag)

        return self.adeify_images(soup)
lui1 is offline   Reply With Quote
Advert
Old 03-25-2019, 12:49 AM   #3
Susa
Junior Member
Susa began at the beginning.
 
Posts: 2
Karma: 10
Join Date: Jan 2019
Device: Kindle Paperwhite 3
Yes, they are bilingual documents and speech transcripts. Thank you very much. It works now!
Susa is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Medscape: failed fetch news Barry6 Recipes 5 04-25-2015 09:31 AM
How to treat multipage articles? flyingfoxlee Recipes 2 12-29-2012 06:38 AM
Failed to fetch news Hemant Calibre 10 08-25-2010 09:22 AM
Calibre, Instapaper, multipage articles and ordering flyash Calibre 1 06-10-2010 07:03 PM
Failed to Fetch Economist wayner Calibre 10 12-19-2009 12:30 AM


All times are GMT -4. The time now is 02:30 AM.


MobileRead.com is a privately owned, operated and funded community.