Hi all,
first of all thanks to Kovid for his great work on Calibre. I'm using it just for some months and now I really depend on the morning news deliverd straight to my kindle.
All I miss is the german newspaper FAZ. I've created a recipe for downloading the json contents from the DEMO page of the e-paper (
http://www.faz.net/e-paper/?GETS=pcp...vitab#DEMO_FAZ ), see the attached source.
Downloading the JSON contents seems to work, but I have some problems with the further processing in Calibre. Can someone please help me here?
1. Calibre seems to use only the first two articles from the first two sections. All other articles are ignored. I think I missed an option somewhere...
2. The documentation for parse_index says, that all articles should be downloaded locally before processing by calibre. Is there a hook, in which I can delete those temporary files after the ebook has been created?
Thanks for your help,
chris
Code:
#!/usr/bin/env python2
# -*- coding: utf-8 -*-
__license__ = 'GPL v3'
import re, string, time, json, urllib, tempfile, os
from calibre import strftime
from datetime import timedelta, date
from time import sleep
from calibre.web.feeds.recipes import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup, Tag, BeautifulStoneSoup
class FAZ(BasicNewsRecipe):
title='Frankfurter Allgemeine Zeitung'
description = 'Demo feed'
__author__ = 'Christoph Klein'
language = 'de'
requires_version = (0, 7, 5)
encoding = 'utf-8'
max_articles_per_feed = 1000
#FAZ structures the JSON data as follows
#1. a file for the issue, with a link to each page
#2. a file for each page of an issue, with links to the articles on that page
#3. a file for each article
def parse_index(self):
#adjust the following variable
#the demo page is usualy 9 days behind. on the 6th of may the issue for 27th of april is avaiable
stichtag="2015-04-27"
#url for the whole issue
url = "http://www.faz.net/e-paper/epaper/overview/DEMO_FAZ/" + stichtag
response = urllib.urlopen(url);
data = json.loads(response.read())
pages=data["seiten"]
books = {}
tempdir=tempfile.mkdtemp() + os.sep
for page in pages:
#downloading json for each page
url2=url+"/"+str(page["nummer"])
response = urllib.urlopen(url2);
page_data = json.loads(response.read())
#print page_data
for article in page_data:
#downloading articles on page
#some **article** are mere layout items and we dont to download these
if article["titel"] == "":
continue
url3='http://www.faz.net/e-paper/epaper/' + article["url"]
response = urllib.urlopen(url3);
article_content = json.loads(response.read())
#some **article**, in particular on the front page, are just brief descriptions for articles
#on a following page. the following heuristic skips these articles
if len(article_content["text"]) < 200:
continue
if not books.has_key(article["buch"]):
books[article["buch"]] = []
tmpfile_name=tempdir+ str(page["nummer"]) + '_' + str(article["documentId"])
article_data={
'title' : article["titel"],
'description' : article["teaser"],
'url' : 'file://' + tmpfile_name
}
print article["titel"]
books[article["buch"]].append(article_data)
f = open(tmpfile_name,"w")
f.write(article_content["text"])
f.close
res=books.items()
#pdb.set_trace()
return res