Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 03-22-2019, 07:06 AM   #1
Susa
Junior Member
Susa began at the beginning.
 
Posts: 2
Karma: 10
Join Date: Jan 2019
Device: Kindle Paperwhite 3
Failed to fetch multipage articles


Hello, I have tried to fetch the articles on https://language.chinadaily.com.cn/5...03f6866ee845c/
but I only got the first pages.The append_page didn't seem to work. I wonder if anyone can help me with the recipe.


Spoiler:

# -*- coding: utf-8 -*-
from calibre.web.feeds.news import BasicNewsRecipe

class shuang1(BasicNewsRecipe):

title = u'权威发布CD'
description = 'From China Daily'
encoding = 'utf-8'
no_stylesheets = True
remove_javascript = True
keep_only_tags = [dict(name='div', attrs={'class':'main_title'}),
dict(name='div', attrs={'class':'mian_txt'})]
def get_title(self, link):
return link.contents[0].strip()
def parse_index(self):
site = 'https://language.chinadaily.com.cn/5af95d44a3103f6866ee845c/'
soup = self.index_to_soup(site)
div = soup.findAll('p', { 'class': 'gy_box_txt2' })
articles = []

for link in div:

til = link.a.contents[0].strip()
url = 'https:' + link.a.get("href")
a = { 'title': til, 'url': url }

articles.append(a)

ans = [(til, articles)]

return ans

def append_page(self, soup, appendtag, position):
pager = soup.find('a', attrs={'class':'pagestyle'})
if pager:
nexturl = 'https:' + pager['href']
soup2 = self.index_to_soup(nexturl)
texttag = soup2.find('div', attrs={'class':'mian_txt'})
newpos = len(texttag.contents)
self.append_page(soup2,texttag,newpos)
texttag.extract()
appendtag.insert(position,texttag)

def preprocess_html(self, soup):
self.append_page(soup, soup.body, 3)
pager = soup.find('a', attrs={'class':'pagestyle'})
if pager:
pager.extract()
return self.adeify_images(soup)

Susa is offline   Reply With Quote
Old 03-23-2019, 09:13 PM   #2
lui1
Enthusiast
lui1 began at the beginning.
 
Posts: 36
Karma: 10
Join Date: Dec 2017
Location: Los Angeles, CA
Device: Smart Phone
Recipe for China Daily

Hello there,

A China Daily recipe in calibre builtins already exists, but it is an English only version. This one seems to be Chinese interleaved with English throughout the text. I hope this helps.

China Daily (Chinese-English):
Code:
#!/usr/bin/env python2
# vim:fileencoding=utf-8
from __future__ import unicode_literals, division, absolute_import, print_function
from calibre.web.feeds.news import BasicNewsRecipe


PAGE_LIMIT = 50


def absurl(url):
    if url.startswith('//'):
        return 'https:' + url
    elif url.startswith('/'):
        return 'https://language.chinadaily.com.cn' + url
    return url


class ChinaDailyCN_EN(BasicNewsRecipe):
    title = u'权威发布CD'
    __author__ = 'Jose Ortiz'
    description = 'From China Daily'
    encoding = 'utf-8'
    language = 'zh'
    no_stylesheets = True
    remove_javascript = True
    keep_only_tags = [
        dict(name='div', attrs={'class':'main_title'}),
        dict(name='div', attrs={'class':'mian_txt'}),
        dict(name='span', attrs={'class':'next'})
    ]

    def parse_index(self):
        site = 'https://language.chinadaily.com.cn/5af95d44a3103f6866ee845c/'
        soup = self.index_to_soup(site)
        plist = soup.findAll('p',{'class':'gy_box_txt2' })
        articles = []
        for a in [p.a for p in plist if p.a]:
            title = self.tag_to_string(a)
            url = absurl(a["href"])
            articles.append({'title': title, 'url': url})
        return [('Articles', articles)]

    def preprocess_html(self, soup):
        try:
            span_next = soup.find('span',{'class':'next'})
            nexturl = absurl(span_next.find('a',{'class':'pagestyle'})['href'])
        except:
            self.log('No extra pages for this one.')
            return self.adeify_images(soup)

        span_next.extract()
        self.log('Found extra page(2) at',nexturl)
        cache = []
        for i in range(PAGE_LIMIT):
            soup2 = self.index_to_soup(nexturl)
            texttag = soup2.find('div',{'class':'mian_txt'})
            texttag.extract()
            cache.insert(0, texttag)
            try:
                span_next = soup2.find('span',{'class':'next'})
                nexturl = absurl(span_next.find('a',{'class':'pagestyle'})['href'])
                self.log('Found extra page(' + unicode(i + 3) + ') at',nexturl)
            except: break
        else:
            self.log.debug('Exhausted page limit of',PAGE_LIMIT)

        div = soup.body.find('div',{'class':'mian_txt'})
        index = 1 + div.parent.contents.index(div)
        for tag in cache:
            div.parent.insert(index,tag)

        return self.adeify_images(soup)
lui1 is offline   Reply With Quote
Advert
Old 03-25-2019, 12:49 AM   #3
Susa
Junior Member
Susa began at the beginning.
 
Posts: 2
Karma: 10
Join Date: Jan 2019
Device: Kindle Paperwhite 3
Yes, they are bilingual documents and speech transcripts. Thank you very much. It works now!
Susa is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Medscape: failed fetch news Barry6 Recipes 5 04-25-2015 09:31 AM
How to treat multipage articles? flyingfoxlee Recipes 2 12-29-2012 06:38 AM
Failed to fetch news Hemant Calibre 10 08-25-2010 09:22 AM
Calibre, Instapaper, multipage articles and ordering flyash Calibre 1 06-10-2010 07:03 PM
Failed to Fetch Economist wayner Calibre 10 12-19-2009 12:30 AM


All times are GMT -4. The time now is 11:24 AM.


MobileRead.com is a privately owned, operated and funded community.