creating a recipe for faz.net's e-paper

MayJune · 10-02-2013, 11:53 AM

Dear All

I have recently acquired a free subscription to faz.net's e-paper (Frankfurter Allgemeine Zeitung) and noticed to my suprise that no recipe exists. Anyway, as I have some coding experience I am happy to implement it myself. However I would like a couple of pointers:

FAZs e-paper is an entire javascript app, works entirely on JSON. While in fact this is brilliant and should make the code very concise and the output pristine, how should I implement this? Should I change the browser's behaviour to ask for application/json, text/javascript? Is there a preferred JSON parser I should use?

I had a quick search through calibre's other news recipes and this situation seems to be unique.

Thanks for all comments in advance

MayJune

kovidgoyal · 10-02-2013, 11:07 PM

The recipe system is meant to work with html not JSON, I suggest using the html pages on the actual faz website. If you really want to use JSON, then you would need to implement get_obfuscated_article() in your recipe and convert the JSON to html. python hs a builtin json parses (import json) that you can use for the purpose.

chris23 · 05-06-2015, 03:04 PM

Hi all,

first of all thanks to Kovid for his great work on Calibre. I'm using it just for some months and now I really depend on the morning news deliverd straight to my kindle.

All I miss is the german newspaper FAZ. I've created a recipe for downloading the json contents from the DEMO page of the e-paper ( http://www.faz.net/e-paper/?GETS=pcp...vitab#DEMO_FAZ ), see the attached source.

Downloading the JSON contents seems to work, but I have some problems with the further processing in Calibre. Can someone please help me here?

1. Calibre seems to use only the first two articles from the first two sections. All other articles are ignored. I think I missed an option somewhere...

2. The documentation for parse_index says, that all articles should be downloaded locally before processing by calibre. Is there a hook, in which I can delete those temporary files after the ebook has been created?

Thanks for your help,

chris

Code:

#!/usr/bin/env  python2
# -*- coding: utf-8 -*-
__license__   = 'GPL v3'

import re, string, time, json, urllib, tempfile, os
from calibre import strftime
from datetime import timedelta, date
from time import sleep
from calibre.web.feeds.recipes import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup, Tag, BeautifulStoneSoup

class FAZ(BasicNewsRecipe):




    title='Frankfurter Allgemeine Zeitung'
    description = 'Demo feed'

    __author__  = 'Christoph Klein'
    language = 'de'
    requires_version = (0, 7, 5)
    encoding = 'utf-8'

    max_articles_per_feed = 1000

    #FAZ structures the JSON data as follows
    #1. a file for the issue, with a link to each page
    #2. a file for each page of an issue, with links to the articles on that page
    #3. a file for each article


    def parse_index(self):
        #adjust the following variable
        #the demo page is usualy 9 days behind. on the 6th of may the issue for 27th of april is avaiable

        stichtag="2015-04-27"

        #url for the whole issue
        url = "http://www.faz.net/e-paper/epaper/overview/DEMO_FAZ/" + stichtag
        response = urllib.urlopen(url);
        data = json.loads(response.read())
        pages=data["seiten"]

        books = {}

        tempdir=tempfile.mkdtemp() + os.sep

        for page in pages:
            #downloading json for each page
            url2=url+"/"+str(page["nummer"])
            response = urllib.urlopen(url2);
            page_data = json.loads(response.read())

            
            #print page_data
            for article in page_data:
                #downloading articles on page

                #some **article** are mere layout items and we dont to download these
                if article["titel"] == "":
                    continue
                
            
                url3='http://www.faz.net/e-paper/epaper/' + article["url"]
                response = urllib.urlopen(url3);
                article_content = json.loads(response.read())

                #some **article**, in particular on the front page, are just brief descriptions for articles
                #on a following page. the following heuristic skips these articles
                if len(article_content["text"]) < 200:
                    continue

                if not books.has_key(article["buch"]):
                    books[article["buch"]] = []

                tmpfile_name=tempdir+ str(page["nummer"]) + '_' + str(article["documentId"])    
                article_data={
                    'title'         : article["titel"],
                    'description'   : article["teaser"],
                    'url'           : 'file://' + tmpfile_name
                }

                print article["titel"]

                books[article["buch"]].append(article_data)
                f = open(tmpfile_name,"w")
                f.write(article_content["text"])
                f.close
        
        res=books.items()
        #pdb.set_trace()
        return res

kovidgoyal · 05-06-2015, 04:54 PM

1) You are likely using --test which restricts the fetched data to fitst two articles from the first two feeds by default

2) Use

from calibre.ptempfile import PersistenTemporaryDirectory
tdir = PersistentTemporaryDirectory()

It will be automatically cleaned up on program exit.

chris23 · 05-08-2015, 02:16 AM

thanks Kovid.
I will try to improve the layout of the articles and port the recipe to the paid version of the e-paper in the coming weeks.

pamisoni · 05-08-2015, 05:21 AM

Dear Sir,
I am trying to fetch Gujarati news from Sandesh news.
www.sandesh.com
I have written following code.

Code:

#!/usr/bin/env python2
# vim:fileencoding=utf-8
from __future__ import unicode_literals, division, absolute_import, print_function
from calibre.web.feeds.news import BasicNewsRecipe

class FE_India(BasicNewsRecipe):
    title                 = 'Sandesh'
    __author__            = 'Parag Soni'
    description           = 'Sandesh Gujarati'
    publisher             = 'Sandesh'
    category              = 'news, politics, finances, India'
    oldest_article        = 30
    max_articles_per_feed = 200
    no_stylesheets        = True
    encoding              = 'cp1252'
    use_embedded_content  = False
    language              = 'gu_IN'
    remove_empty_feeds    = True
    masthead_url          = 'http://www.sandesh.com/IMAGES/Sandesh_Logo.gif'
    publication_type      = 'magazine'
    extra_css             = ' body{font-family: Arial,Helvetica,sans-serif } '

    conversion_options = {
                          'comment'   : description
                        , 'tags'      : category
                        , 'publisher' : publisher
                        , 'language'  : language
                        }

    keep_only_tags = [dict(attrs={'class':'txt'})]
    remove_attributes = ['width','height']

    feeds = [(u'National', u'http://www.sandesh.com/cms/xml/National.xml')]
    def print_version(self, url):
        match = re.search(r'newsid=(\d+)', url)
        if not match:
            return url
        return 'http://www.sandesh.com/printarticle.aspx?newsid='+match.group(1)

def postprocess_html(self, soup, first_fetch):
        for t in soup.findAll(['table', 'tr', 'td']):
            t.name = 'div'

        a = soup.find(href='http://www.sandesh.com/')
        if a is not None:
            a.parent.extract()
        return soup

but getting headings only not getting full news.
please support me

Thanks in advance
Parag

beneun · 10-21-2015, 12:33 PM

Great that it´s not possible to download the FAZ Epaper demo feed.

Before I consider getting into making or changing receipts: Am I wrong or would adapting it to the paid FAZ Epaper be only a matter of a few changed lines of code?

And if it is: Could someone do just that?

chris23 · 04-15-2016, 01:39 AM

I've worked on the recipe lately and it works with the official feed now.

It's available on github:

https://gist.github.com/doktorschiwa...97464e2a71771f

When you have comments or suggestions post them here or on github.

chris

kovidgoyal · 04-15-2016, 05:26 AM

Why are you using urllib2? You should instead do you login in the get_browser() method of the recipe. That will mean that all URLs you fetch with self.browser will automatically have the correct cookies. You can see an example of sending a custom request to do login in the builtin discover magazine recipe.

And the browser automatically supports the users proxy settings if any.

10-21-2015, 12:33 PM	#7
beneun Junior Member Posts: 2 Karma: 10 Join Date: Jan 2015 Device: Sony PRS-T1	Receipt for paid FAZ Epaper Great that it´s not possible to download the FAZ Epaper demo feed. Before I consider getting into making or changing receipts: Am I wrong or would adapting it to the paid FAZ Epaper be only a matter of a few changed lines of code? And if it is: Could someone do just that?

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
FAZ-Net Update	Divingduck	Recipes	14	05-29-2022 11:26 AM
.net magazine recipe	cram1010	Recipes	0	07-21-2012 09:26 AM
FAZ.NET recipe fails due to website redesign	juco	Recipes	7	10-07-2011 11:53 AM
FAZ.NET: Website-Redesign macht das calibre-Rezept wertlos	juco	Software	1	10-05-2011 02:42 AM
recipe for FAZ.net - german	schuster	Recipes	10	05-28-2011 12:13 AM

10-02-2013, 11:53 AM	#1
MayJune Junior Member Posts: 1 Karma: 10 Join Date: Aug 2013 Device: Kindle Paperwhite 3G	creating a recipe for faz.net's e-paper Dear All I have recently acquired a free subscription to faz.net's e-paper (Frankfurter Allgemeine Zeitung) and noticed to my suprise that no recipe exists. Anyway, as I have some coding experience I am happy to implement it myself. However I would like a couple of pointers: FAZs e-paper is an entire javascript app, works entirely on JSON. While in fact this is brilliant and should make the code very concise and the output pristine, how should I implement this? Should I change the browser's behaviour to ask for application/json, text/javascript? Is there a preferred JSON parser I should use? I had a quick search through calibre's other news recipes and this situation seems to be unique. Thanks for all comments in advance MayJune

10-02-2013, 11:07 PM	#2
kovidgoyal creator of calibre Posts: 45,310 Karma: 27111242 Join Date: Oct 2006 Location: Mumbai, India Device: Various	The recipe system is meant to work with html not JSON, I suggest using the html pages on the actual faz website. If you really want to use JSON, then you would need to implement get_obfuscated_article() in your recipe and convert the JSON to html. python hs a builtin json parses (import json) that you can use for the purpose.

05-06-2015, 04:54 PM	#4
kovidgoyal creator of calibre Posts: 45,310 Karma: 27111242 Join Date: Oct 2006 Location: Mumbai, India Device: Various	1) You are likely using --test which restricts the fetched data to fitst two articles from the first two feeds by default 2) Use from calibre.ptempfile import PersistenTemporaryDirectory tdir = PersistentTemporaryDirectory() It will be automatically cleaned up on program exit.

05-08-2015, 02:16 AM	#5
chris23 Junior Member Posts: 4 Karma: 10 Join Date: May 2015 Device: Kindle	thanks Kovid. I will try to improve the layout of the articles and port the recipe to the paid version of the e-paper in the coming weeks.

04-15-2016, 01:39 AM	#8
chris23 Junior Member Posts: 4 Karma: 10 Join Date: May 2015 Device: Kindle	I've worked on the recipe lately and it works with the official feed now. It's available on github: https://gist.github.com/doktorschiwa...97464e2a71771f When you have comments or suggestions post them here or on github. chris

04-15-2016, 05:26 AM	#9
kovidgoyal creator of calibre Posts: 45,310 Karma: 27111242 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Why are you using urllib2? You should instead do you login in the get_browser() method of the recipe. That will mean that all URLs you fetch with self.browser will automatically have the correct cookies. You can see an example of sending a custom request to do login in the builtin discover magazine recipe. And the browser automatically supports the users proxy settings if any.

Advert

Advert