MobileRead Forums - View Single Post - creating a recipe for faz.net's e-paper

chris23 · 05-06-2015, 03:04 PM

Hi all,

first of all thanks to Kovid for his great work on Calibre. I'm using it just for some months and now I really depend on the morning news deliverd straight to my kindle.

All I miss is the german newspaper FAZ. I've created a recipe for downloading the json contents from the DEMO page of the e-paper ( http://www.faz.net/e-paper/?GETS=pcp...vitab#DEMO_FAZ ), see the attached source.

Downloading the JSON contents seems to work, but I have some problems with the further processing in Calibre. Can someone please help me here?

1. Calibre seems to use only the first two articles from the first two sections. All other articles are ignored. I think I missed an option somewhere...

2. The documentation for parse_index says, that all articles should be downloaded locally before processing by calibre. Is there a hook, in which I can delete those temporary files after the ebook has been created?

Thanks for your help,

chris

Code:

#!/usr/bin/env  python2
# -*- coding: utf-8 -*-
__license__   = 'GPL v3'

import re, string, time, json, urllib, tempfile, os
from calibre import strftime
from datetime import timedelta, date
from time import sleep
from calibre.web.feeds.recipes import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup, Tag, BeautifulStoneSoup

class FAZ(BasicNewsRecipe):




    title='Frankfurter Allgemeine Zeitung'
    description = 'Demo feed'

    __author__  = 'Christoph Klein'
    language = 'de'
    requires_version = (0, 7, 5)
    encoding = 'utf-8'

    max_articles_per_feed = 1000

    #FAZ structures the JSON data as follows
    #1. a file for the issue, with a link to each page
    #2. a file for each page of an issue, with links to the articles on that page
    #3. a file for each article


    def parse_index(self):
        #adjust the following variable
        #the demo page is usualy 9 days behind. on the 6th of may the issue for 27th of april is avaiable

        stichtag="2015-04-27"

        #url for the whole issue
        url = "http://www.faz.net/e-paper/epaper/overview/DEMO_FAZ/" + stichtag
        response = urllib.urlopen(url);
        data = json.loads(response.read())
        pages=data["seiten"]

        books = {}

        tempdir=tempfile.mkdtemp() + os.sep

        for page in pages:
            #downloading json for each page
            url2=url+"/"+str(page["nummer"])
            response = urllib.urlopen(url2);
            page_data = json.loads(response.read())

            
            #print page_data
            for article in page_data:
                #downloading articles on page

                #some **article** are mere layout items and we dont to download these
                if article["titel"] == "":
                    continue
                
            
                url3='http://www.faz.net/e-paper/epaper/' + article["url"]
                response = urllib.urlopen(url3);
                article_content = json.loads(response.read())

                #some **article**, in particular on the front page, are just brief descriptions for articles
                #on a following page. the following heuristic skips these articles
                if len(article_content["text"]) < 200:
                    continue

                if not books.has_key(article["buch"]):
                    books[article["buch"]] = []

                tmpfile_name=tempdir+ str(page["nummer"]) + '_' + str(article["documentId"])    
                article_data={
                    'title'         : article["titel"],
                    'description'   : article["teaser"],
                    'url'           : 'file://' + tmpfile_name
                }

                print article["titel"]

                books[article["buch"]].append(article_data)
                f = open(tmpfile_name,"w")
                f.write(article_content["text"])
                f.close
        
        res=books.items()
        #pdb.set_trace()
        return res