03-22-2019, 07:06 AM | #1 |
Junior Member
Posts: 2
Karma: 10
Join Date: Jan 2019
Device: Kindle Paperwhite 3
|
Failed to fetch multipage articles
Hello, I have tried to fetch the articles on https://language.chinadaily.com.cn/5...03f6866ee845c/ but I only got the first pages.The append_page didn't seem to work. I wonder if anyone can help me with the recipe. Spoiler:
|
03-23-2019, 09:13 PM | #2 |
Enthusiast
Posts: 36
Karma: 10
Join Date: Dec 2017
Location: Los Angeles, CA
Device: Smart Phone
|
Recipe for China Daily
Hello there,
A China Daily recipe in calibre builtins already exists, but it is an English only version. This one seems to be Chinese interleaved with English throughout the text. I hope this helps. China Daily (Chinese-English): Code:
#!/usr/bin/env python2 # vim:fileencoding=utf-8 from __future__ import unicode_literals, division, absolute_import, print_function from calibre.web.feeds.news import BasicNewsRecipe PAGE_LIMIT = 50 def absurl(url): if url.startswith('//'): return 'https:' + url elif url.startswith('/'): return 'https://language.chinadaily.com.cn' + url return url class ChinaDailyCN_EN(BasicNewsRecipe): title = u'权威发布CD' __author__ = 'Jose Ortiz' description = 'From China Daily' encoding = 'utf-8' language = 'zh' no_stylesheets = True remove_javascript = True keep_only_tags = [ dict(name='div', attrs={'class':'main_title'}), dict(name='div', attrs={'class':'mian_txt'}), dict(name='span', attrs={'class':'next'}) ] def parse_index(self): site = 'https://language.chinadaily.com.cn/5af95d44a3103f6866ee845c/' soup = self.index_to_soup(site) plist = soup.findAll('p',{'class':'gy_box_txt2' }) articles = [] for a in [p.a for p in plist if p.a]: title = self.tag_to_string(a) url = absurl(a["href"]) articles.append({'title': title, 'url': url}) return [('Articles', articles)] def preprocess_html(self, soup): try: span_next = soup.find('span',{'class':'next'}) nexturl = absurl(span_next.find('a',{'class':'pagestyle'})['href']) except: self.log('No extra pages for this one.') return self.adeify_images(soup) span_next.extract() self.log('Found extra page(2) at',nexturl) cache = [] for i in range(PAGE_LIMIT): soup2 = self.index_to_soup(nexturl) texttag = soup2.find('div',{'class':'mian_txt'}) texttag.extract() cache.insert(0, texttag) try: span_next = soup2.find('span',{'class':'next'}) nexturl = absurl(span_next.find('a',{'class':'pagestyle'})['href']) self.log('Found extra page(' + unicode(i + 3) + ') at',nexturl) except: break else: self.log.debug('Exhausted page limit of',PAGE_LIMIT) div = soup.body.find('div',{'class':'mian_txt'}) index = 1 + div.parent.contents.index(div) for tag in cache: div.parent.insert(index,tag) return self.adeify_images(soup) |
Advert | |
|
03-25-2019, 12:49 AM | #3 |
Junior Member
Posts: 2
Karma: 10
Join Date: Jan 2019
Device: Kindle Paperwhite 3
|
Yes, they are bilingual documents and speech transcripts. Thank you very much. It works now!
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Medscape: failed fetch news | Barry6 | Recipes | 5 | 04-25-2015 09:31 AM |
How to treat multipage articles? | flyingfoxlee | Recipes | 2 | 12-29-2012 06:38 AM |
Failed to fetch news | Hemant | Calibre | 10 | 08-25-2010 09:22 AM |
Calibre, Instapaper, multipage articles and ordering | flyash | Calibre | 1 | 06-10-2010 07:03 PM |
Failed to Fetch Economist | wayner | Calibre | 10 | 12-19-2009 12:30 AM |