![]() |
#1 |
Junior Member
![]() Posts: 1
Karma: 10
Join Date: Aug 2013
Device: Kindle Paperwhite 3G
|
![]()
Dear All
I have recently acquired a free subscription to faz.net's e-paper (Frankfurter Allgemeine Zeitung) and noticed to my suprise that no recipe exists. Anyway, as I have some coding experience I am happy to implement it myself. However I would like a couple of pointers: FAZs e-paper is an entire javascript app, works entirely on JSON. While in fact this is brilliant and should make the code very concise and the output pristine, how should I implement this? Should I change the browser's behaviour to ask for application/json, text/javascript? Is there a preferred JSON parser I should use? I had a quick search through calibre's other news recipes and this situation seems to be unique. Thanks for all comments in advance MayJune |
![]() |
![]() |
![]() |
#2 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,188
Karma: 27110894
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
The recipe system is meant to work with html not JSON, I suggest using the html pages on the actual faz website. If you really want to use JSON, then you would need to implement get_obfuscated_article() in your recipe and convert the JSON to html. python hs a builtin json parses (import json) that you can use for the purpose.
|
![]() |
![]() |
Advert | |
|
![]() |
#3 |
Junior Member
![]() Posts: 4
Karma: 10
Join Date: May 2015
Device: Kindle
|
Hi all,
first of all thanks to Kovid for his great work on Calibre. I'm using it just for some months and now I really depend on the morning news deliverd straight to my kindle. All I miss is the german newspaper FAZ. I've created a recipe for downloading the json contents from the DEMO page of the e-paper ( http://www.faz.net/e-paper/?GETS=pcp...vitab#DEMO_FAZ ), see the attached source. Downloading the JSON contents seems to work, but I have some problems with the further processing in Calibre. Can someone please help me here? 1. Calibre seems to use only the first two articles from the first two sections. All other articles are ignored. I think I missed an option somewhere... 2. The documentation for parse_index says, that all articles should be downloaded locally before processing by calibre. Is there a hook, in which I can delete those temporary files after the ebook has been created? Thanks for your help, chris Code:
#!/usr/bin/env python2 # -*- coding: utf-8 -*- __license__ = 'GPL v3' import re, string, time, json, urllib, tempfile, os from calibre import strftime from datetime import timedelta, date from time import sleep from calibre.web.feeds.recipes import BasicNewsRecipe from calibre.ebooks.BeautifulSoup import BeautifulSoup, Tag, BeautifulStoneSoup class FAZ(BasicNewsRecipe): title='Frankfurter Allgemeine Zeitung' description = 'Demo feed' __author__ = 'Christoph Klein' language = 'de' requires_version = (0, 7, 5) encoding = 'utf-8' max_articles_per_feed = 1000 #FAZ structures the JSON data as follows #1. a file for the issue, with a link to each page #2. a file for each page of an issue, with links to the articles on that page #3. a file for each article def parse_index(self): #adjust the following variable #the demo page is usualy 9 days behind. on the 6th of may the issue for 27th of april is avaiable stichtag="2015-04-27" #url for the whole issue url = "http://www.faz.net/e-paper/epaper/overview/DEMO_FAZ/" + stichtag response = urllib.urlopen(url); data = json.loads(response.read()) pages=data["seiten"] books = {} tempdir=tempfile.mkdtemp() + os.sep for page in pages: #downloading json for each page url2=url+"/"+str(page["nummer"]) response = urllib.urlopen(url2); page_data = json.loads(response.read()) #print page_data for article in page_data: #downloading articles on page #some **article** are mere layout items and we dont to download these if article["titel"] == "": continue url3='http://www.faz.net/e-paper/epaper/' + article["url"] response = urllib.urlopen(url3); article_content = json.loads(response.read()) #some **article**, in particular on the front page, are just brief descriptions for articles #on a following page. the following heuristic skips these articles if len(article_content["text"]) < 200: continue if not books.has_key(article["buch"]): books[article["buch"]] = [] tmpfile_name=tempdir+ str(page["nummer"]) + '_' + str(article["documentId"]) article_data={ 'title' : article["titel"], 'description' : article["teaser"], 'url' : 'file://' + tmpfile_name } print article["titel"] books[article["buch"]].append(article_data) f = open(tmpfile_name,"w") f.write(article_content["text"]) f.close res=books.items() #pdb.set_trace() return res Last edited by PeterT; 05-06-2015 at 03:19 PM. Reason: Change quote to code to ensure indentation is displayed |
![]() |
![]() |
![]() |
#4 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,188
Karma: 27110894
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
1) You are likely using --test which restricts the fetched data to fitst two articles from the first two feeds by default
2) Use from calibre.ptempfile import PersistenTemporaryDirectory tdir = PersistentTemporaryDirectory() It will be automatically cleaned up on program exit. |
![]() |
![]() |
![]() |
#5 |
Junior Member
![]() Posts: 4
Karma: 10
Join Date: May 2015
Device: Kindle
|
thanks Kovid.
I will try to improve the layout of the articles and port the recipe to the paid version of the e-paper in the coming weeks. |
![]() |
![]() |
Advert | |
|
![]() |
#6 |
Junior Member
![]() Posts: 1
Karma: 10
Join Date: May 2015
Device: kobo
|
Dear Sir,
I am trying to fetch Gujarati news from Sandesh news. www.sandesh.com I have written following code. Code:
#!/usr/bin/env python2 # vim:fileencoding=utf-8 from __future__ import unicode_literals, division, absolute_import, print_function from calibre.web.feeds.news import BasicNewsRecipe class FE_India(BasicNewsRecipe): title = 'Sandesh' __author__ = 'Parag Soni' description = 'Sandesh Gujarati' publisher = 'Sandesh' category = 'news, politics, finances, India' oldest_article = 30 max_articles_per_feed = 200 no_stylesheets = True encoding = 'cp1252' use_embedded_content = False language = 'gu_IN' remove_empty_feeds = True masthead_url = 'http://www.sandesh.com/IMAGES/Sandesh_Logo.gif' publication_type = 'magazine' extra_css = ' body{font-family: Arial,Helvetica,sans-serif } ' conversion_options = { 'comment' : description , 'tags' : category , 'publisher' : publisher , 'language' : language } keep_only_tags = [dict(attrs={'class':'txt'})] remove_attributes = ['width','height'] feeds = [(u'National', u'http://www.sandesh.com/cms/xml/National.xml')] def print_version(self, url): match = re.search(r'newsid=(\d+)', url) if not match: return url return 'http://www.sandesh.com/printarticle.aspx?newsid='+match.group(1) def postprocess_html(self, soup, first_fetch): for t in soup.findAll(['table', 'tr', 'td']): t.name = 'div' a = soup.find(href='http://www.sandesh.com/') if a is not None: a.parent.extract() return soup please support me Thanks in advance Parag Last edited by pamisoni; 05-08-2015 at 05:38 AM. Reason: code |
![]() |
![]() |
![]() |
#7 |
Junior Member
![]() Posts: 2
Karma: 10
Join Date: Jan 2015
Device: Sony PRS-T1
|
Receipt for paid FAZ Epaper
Great that it´s not possible to download the FAZ Epaper demo feed.
Before I consider getting into making or changing receipts: Am I wrong or would adapting it to the paid FAZ Epaper be only a matter of a few changed lines of code? And if it is: Could someone do just that? |
![]() |
![]() |
![]() |
#8 |
Junior Member
![]() Posts: 4
Karma: 10
Join Date: May 2015
Device: Kindle
|
I've worked on the recipe lately and it works with the official feed now.
It's available on github: https://gist.github.com/doktorschiwa...97464e2a71771f When you have comments or suggestions post them here or on github. chris |
![]() |
![]() |
![]() |
#9 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,188
Karma: 27110894
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Why are you using urllib2? You should instead do you login in the get_browser() method of the recipe. That will mean that all URLs you fetch with self.browser will automatically have the correct cookies. You can see an example of sending a custom request to do login in the builtin discover magazine recipe.
And the browser automatically supports the users proxy settings if any. |
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
FAZ-Net Update | Divingduck | Recipes | 14 | 05-29-2022 11:26 AM |
.net magazine recipe | cram1010 | Recipes | 0 | 07-21-2012 09:26 AM |
FAZ.NET recipe fails due to website redesign | juco | Recipes | 7 | 10-07-2011 11:53 AM |
FAZ.NET: Website-Redesign macht das calibre-Rezept wertlos | juco | Software | 1 | 10-05-2011 02:42 AM |
recipe for FAZ.net - german | schuster | Recipes | 10 | 05-28-2011 12:13 AM |