Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 10-02-2013, 11:53 AM   #1
MayJune
Junior Member
MayJune began at the beginning.
 
Posts: 1
Karma: 10
Join Date: Aug 2013
Device: Kindle Paperwhite 3G
Question creating a recipe for faz.net's e-paper

Dear All

I have recently acquired a free subscription to faz.net's e-paper (Frankfurter Allgemeine Zeitung) and noticed to my suprise that no recipe exists. Anyway, as I have some coding experience I am happy to implement it myself. However I would like a couple of pointers:

FAZs e-paper is an entire javascript app, works entirely on JSON. While in fact this is brilliant and should make the code very concise and the output pristine, how should I implement this? Should I change the browser's behaviour to ask for application/json, text/javascript? Is there a preferred JSON parser I should use?

I had a quick search through calibre's other news recipes and this situation seems to be unique.

Thanks for all comments in advance

MayJune
MayJune is offline   Reply With Quote
Old 10-02-2013, 11:07 PM   #2
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 45,188
Karma: 27110894
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
The recipe system is meant to work with html not JSON, I suggest using the html pages on the actual faz website. If you really want to use JSON, then you would need to implement get_obfuscated_article() in your recipe and convert the JSON to html. python hs a builtin json parses (import json) that you can use for the purpose.
kovidgoyal is offline   Reply With Quote
Advert
Old 05-06-2015, 03:04 PM   #3
chris23
Junior Member
chris23 began at the beginning.
 
Posts: 4
Karma: 10
Join Date: May 2015
Device: Kindle
Hi all,

first of all thanks to Kovid for his great work on Calibre. I'm using it just for some months and now I really depend on the morning news deliverd straight to my kindle.

All I miss is the german newspaper FAZ. I've created a recipe for downloading the json contents from the DEMO page of the e-paper ( http://www.faz.net/e-paper/?GETS=pcp...vitab#DEMO_FAZ ), see the attached source.

Downloading the JSON contents seems to work, but I have some problems with the further processing in Calibre. Can someone please help me here?

1. Calibre seems to use only the first two articles from the first two sections. All other articles are ignored. I think I missed an option somewhere...

2. The documentation for parse_index says, that all articles should be downloaded locally before processing by calibre. Is there a hook, in which I can delete those temporary files after the ebook has been created?

Thanks for your help,

chris
Code:
#!/usr/bin/env  python2
# -*- coding: utf-8 -*-
__license__   = 'GPL v3'

import re, string, time, json, urllib, tempfile, os
from calibre import strftime
from datetime import timedelta, date
from time import sleep
from calibre.web.feeds.recipes import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup, Tag, BeautifulStoneSoup

class FAZ(BasicNewsRecipe):




    title='Frankfurter Allgemeine Zeitung'
    description = 'Demo feed'

    __author__  = 'Christoph Klein'
    language = 'de'
    requires_version = (0, 7, 5)
    encoding = 'utf-8'

    max_articles_per_feed = 1000

    #FAZ structures the JSON data as follows
    #1. a file for the issue, with a link to each page
    #2. a file for each page of an issue, with links to the articles on that page
    #3. a file for each article


    def parse_index(self):
        #adjust the following variable
        #the demo page is usualy 9 days behind. on the 6th of may the issue for 27th of april is avaiable

        stichtag="2015-04-27"

        #url for the whole issue
        url = "http://www.faz.net/e-paper/epaper/overview/DEMO_FAZ/" + stichtag
        response = urllib.urlopen(url);
        data = json.loads(response.read())
        pages=data["seiten"]

        books = {}

        tempdir=tempfile.mkdtemp() + os.sep

        for page in pages:
            #downloading json for each page
            url2=url+"/"+str(page["nummer"])
            response = urllib.urlopen(url2);
            page_data = json.loads(response.read())

            
            #print page_data
            for article in page_data:
                #downloading articles on page

                #some **article** are mere layout items and we dont to download these
                if article["titel"] == "":
                    continue
                
            
                url3='http://www.faz.net/e-paper/epaper/' + article["url"]
                response = urllib.urlopen(url3);
                article_content = json.loads(response.read())

                #some **article**, in particular on the front page, are just brief descriptions for articles
                #on a following page. the following heuristic skips these articles
                if len(article_content["text"]) < 200:
                    continue

                if not books.has_key(article["buch"]):
                    books[article["buch"]] = []

                tmpfile_name=tempdir+ str(page["nummer"]) + '_' + str(article["documentId"])    
                article_data={
                    'title'         : article["titel"],
                    'description'   : article["teaser"],
                    'url'           : 'file://' + tmpfile_name
                }

                print article["titel"]

                books[article["buch"]].append(article_data)
                f = open(tmpfile_name,"w")
                f.write(article_content["text"])
                f.close
        
        res=books.items()
        #pdb.set_trace()
        return res

Last edited by PeterT; 05-06-2015 at 03:19 PM. Reason: Change quote to code to ensure indentation is displayed
chris23 is offline   Reply With Quote
Old 05-06-2015, 04:54 PM   #4
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 45,188
Karma: 27110894
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
1) You are likely using --test which restricts the fetched data to fitst two articles from the first two feeds by default

2) Use

from calibre.ptempfile import PersistenTemporaryDirectory
tdir = PersistentTemporaryDirectory()

It will be automatically cleaned up on program exit.
kovidgoyal is offline   Reply With Quote
Old 05-08-2015, 02:16 AM   #5
chris23
Junior Member
chris23 began at the beginning.
 
Posts: 4
Karma: 10
Join Date: May 2015
Device: Kindle
thanks Kovid.
I will try to improve the layout of the articles and port the recipe to the paid version of the e-paper in the coming weeks.
chris23 is offline   Reply With Quote
Advert
Old 05-08-2015, 05:21 AM   #6
pamisoni
Junior Member
pamisoni began at the beginning.
 
Posts: 1
Karma: 10
Join Date: May 2015
Device: kobo
Dear Sir,
I am trying to fetch Gujarati news from Sandesh news.
www.sandesh.com
I have written following code.
Code:
#!/usr/bin/env python2
# vim:fileencoding=utf-8
from __future__ import unicode_literals, division, absolute_import, print_function
from calibre.web.feeds.news import BasicNewsRecipe

class FE_India(BasicNewsRecipe):
    title                 = 'Sandesh'
    __author__            = 'Parag Soni'
    description           = 'Sandesh Gujarati'
    publisher             = 'Sandesh'
    category              = 'news, politics, finances, India'
    oldest_article        = 30
    max_articles_per_feed = 200
    no_stylesheets        = True
    encoding              = 'cp1252'
    use_embedded_content  = False
    language              = 'gu_IN'
    remove_empty_feeds    = True
    masthead_url          = 'http://www.sandesh.com/IMAGES/Sandesh_Logo.gif'
    publication_type      = 'magazine'
    extra_css             = ' body{font-family: Arial,Helvetica,sans-serif } '

    conversion_options = {
                          'comment'   : description
                        , 'tags'      : category
                        , 'publisher' : publisher
                        , 'language'  : language
                        }

    keep_only_tags = [dict(attrs={'class':'txt'})]
    remove_attributes = ['width','height']

    feeds = [(u'National', u'http://www.sandesh.com/cms/xml/National.xml')]
    def print_version(self, url):
        match = re.search(r'newsid=(\d+)', url)
        if not match:
            return url
        return 'http://www.sandesh.com/printarticle.aspx?newsid='+match.group(1)

def postprocess_html(self, soup, first_fetch):
        for t in soup.findAll(['table', 'tr', 'td']):
            t.name = 'div'

        a = soup.find(href='http://www.sandesh.com/')
        if a is not None:
            a.parent.extract()
        return soup
but getting headings only not getting full news.
please support me

Thanks in advance
Parag

Last edited by pamisoni; 05-08-2015 at 05:38 AM. Reason: code
pamisoni is offline   Reply With Quote
Old 10-21-2015, 12:33 PM   #7
beneun
Junior Member
beneun began at the beginning.
 
Posts: 2
Karma: 10
Join Date: Jan 2015
Device: Sony PRS-T1
Receipt for paid FAZ Epaper

Great that it´s not possible to download the FAZ Epaper demo feed.

Before I consider getting into making or changing receipts: Am I wrong or would adapting it to the paid FAZ Epaper be only a matter of a few changed lines of code?

And if it is: Could someone do just that?
beneun is offline   Reply With Quote
Old 04-15-2016, 01:39 AM   #8
chris23
Junior Member
chris23 began at the beginning.
 
Posts: 4
Karma: 10
Join Date: May 2015
Device: Kindle
I've worked on the recipe lately and it works with the official feed now.

It's available on github:

https://gist.github.com/doktorschiwa...97464e2a71771f

When you have comments or suggestions post them here or on github.

chris
chris23 is offline   Reply With Quote
Old 04-15-2016, 05:26 AM   #9
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 45,188
Karma: 27110894
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Why are you using urllib2? You should instead do you login in the get_browser() method of the recipe. That will mean that all URLs you fetch with self.browser will automatically have the correct cookies. You can see an example of sending a custom request to do login in the builtin discover magazine recipe.

And the browser automatically supports the users proxy settings if any.
kovidgoyal is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
FAZ-Net Update Divingduck Recipes 14 05-29-2022 11:26 AM
.net magazine recipe cram1010 Recipes 0 07-21-2012 09:26 AM
FAZ.NET recipe fails due to website redesign juco Recipes 7 10-07-2011 11:53 AM
FAZ.NET: Website-Redesign macht das calibre-Rezept wertlos juco Software 1 10-05-2011 02:42 AM
recipe for FAZ.net - german schuster Recipes 10 05-28-2011 12:13 AM


All times are GMT -4. The time now is 01:38 PM.


MobileRead.com is a privately owned, operated and funded community.