Custom recipes (archive, read-only) - Page 31

ax42 · 04-18-2009, 10:05 AM

Hi,

I'm trying to build a recipe for the following page which lists the current films showing in the Zurich cinemas:

http://www.kulturinfo.ch/kino/db_front/showact.php

Each link goes to a description of the film. I would like to end up with an ebook where the films are the "Chapters".

So far I have the following code:

Code:

#!/usr/bin/env  python
# vim:et:sts:sw=4:sts=4
# Last modified: 2009 Apr 18
"""
zhkimo
"""
import string, re
from calibre import strftime
from calibre.web.feeds.recipes import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup

class zhkino(BasicNewsRecipe):

    title = "Zurich Cinema"
    __author__ = "Alexis Iglauer"
    description = "Weekly Cinema listing for Zurich"
    index = 'http://www.kulturinfo.ch/kino/db_front/showact.php'
    #remove_tags_before = dict(name='div', id='storytop')
    #remove_tags        = [dict(name='div', id=['seealso', 'storybottom', 'footer', 'ad_banner_top', 'sidebar'])]
    no_stylesheets     = True
    #feeds          = [ ('News Front Page', 'http://newsrss.bbc.co.uk/rss/newsonline_world_edition/front_page/rss.xml')] 


    def parse_index(self):
        return [('axtst',{'title':'T', 'date':'D', 'url',"U", 'description':"D"})]

As you can see, I am commenting out a few things to get a handle on the error I am getting:

Code:

ax@shiny:/pub/Books/tmp/t$ feeds2disk --debug --test zhkino.py 
Traceback (most recent call last):
  File "/Applications/Tools/calibre.app/Contents/Resources/loaders/feeds2disk.py", line 9, in <module>
    main()
  File "/Applications/Tools/calibre.app/Contents/Resources/lib/python2.6/site-packages.zip/calibre/web/feeds/main.py", line 164, in main
  File "/Applications/Tools/calibre.app/Contents/Resources/lib/python2.6/site-packages.zip/calibre/web/feeds/main.py", line 135, in run_recipe
  File "calibre/web/feeds/recipes/__init__.pyo", line 106, in compile_recipe
  File "/var/folders/QO/QONZvFNdEi0MpoGeGOCCBk+++TI/-Tmp-/calibre_0.5.7__c2R1d_recipes/recipe1.py", line 5, in <module>
    zhkino.py
NameError: name 'zhkino' is not defined

BTW this is on OS X with calibre 0.5.7. Any pointers would be much appreciated.

Kind regards
Alexis

laborg · 04-18-2009, 12:47 PM

Quote:

Originally Posted by ax42

Hi,

Code:

    def parse_index(self):
        return [('axtst',{'title':'T', 'date':'D', 'url',"U", 'description':"D"})]

Alexis

You forgot a ":" between url and "U" ...

ax42 · 04-18-2009, 12:55 PM

Quote:

Originally Posted by laborg

You forgot a ":" between url and "U" ...

Well spotted, thanks!

kovidgoyal · 04-18-2009, 01:34 PM

@redp

You can't do TXT output at the moment. The 0.6.0 release of calibre will support this, so you have to wait until then.

redp · 04-18-2009, 03:55 PM

Quote:

Originally Posted by kovidgoyal

@redp

You can't do TXT output at the moment. The 0.6.0 release of calibre will support this, so you have to wait until then.

With all due respect kovidgoyal, I think you underestimate the flexibility of Calibre... I think it is pretty easy to extract text from tags bearing news and later use reg exp to strip the body off tags. If I understood you correctly, in the future release you can do it with one command, but I bet you can do the same with 3-4 py commands with the current version... Good chance I miss something so I beg my paddorn in advance,

Redp

kovidgoyal · 04-18-2009, 04:30 PM

The output of recipes is saved as HTML and then processed by the rest of the conversion system. You can certainly write an arbitrarily complex recipe that does whatever you want and then use it with the feeds2disk command to output, but you're on your own doing that

ax42 · 04-18-2009, 08:02 PM

Right, I'm making progress with my Zurich Cinemas script, but am running into a conceptual issue -- apologies if this is clear in the manual somewhere but so far I haven't found it.

The page http://www.kulturinfo.ch/kino/db_front/showact.php contains a list of films. I would like this list to be the 'table of contents' of my eBook and each link to go to a page giving the film details (as happens when you click on the webpage link). I'm busy overriding parse_index to get a list of feeds but seem to be stuck between choosing one of the following two options:

a) Return a list of films, which makes each film heading a feed with one article. This seems to lead to an intermediate page between the 'table of contents' and the actual film description, with this intermediate page having just the one film on it

b) Return a one-item list, with all films attached as a list of articles to this one feed. This causes an table of contents with a single entry in it. The example I've been cribbing off (The Atlantic) does this too.

Is there any way to not have either 'interstitials' like in a) or a single-entry ToC as in b)? If not, I'd probably choose b) as the lesser evil.......

Thanks
ax42

OlaNordmann · 04-18-2009, 08:04 PM

Guys.. What am I doing wrong?

Quote:

feeds = [(u'Nyheter utenriks', u'http://www1.vg.no/rss/create.php?categories=12&keywords=&limit=10')]

def print_version(self, url):
return url.replace('http://go.vg.no/cgi-bin/go.cgi/vg-rss-12/http://www.vg.no/nyheter/utenriks/artikkel.php', 'http://www.vg.no/pub/skrivervennlig.hbs')

Fu*ker goes ahead and fetches "http://go.vg.no/cgi-bin/go.cgi/vg-rss-12/http://www.vg.no/nyheter/utenriks/artikkel.php?artid=562364" instead of "http://www.vg.no/pub/skrivervennlig.hbs?artid=562364"

kiklop74 · 04-18-2009, 09:17 PM

Quote:

Originally Posted by OlaNordmann

Guys.. What am I doing wrong?

Complicating things.

Always keep it simple.

Code:

    def print_version(self, url):
        uneeded, sep, article_id = url.rpartition('artid=')
        return 'http://www.vg.no/pub/skrivervennlig.hbs?artid=' + article_id

kovidgoyal · 04-18-2009, 09:34 PM

@ax42 You can acheive whatever effect you want by overriding create_opf in your recipe

kiklop74 · 04-18-2009, 09:35 PM

Quote:

Originally Posted by ax42

The page http://www.kulturinfo.ch/kino/db_front/showact.php contains a list of films. I would like this list to be the 'table of contents' of my eBook and each link to go to a page giving the film details (as happens when you click on the webpage link). I'm busy overriding parse_index to get a list of feeds but seem to be stuck between choosing one of the following two options:

a) Return a list of films, which makes each film heading a feed with one article. This seems to lead to an intermediate page between the 'table of contents' and the actual film description, with this intermediate page having just the one film on it

Why?? This is quite pointless.

Quote:

Originally Posted by ax42

b) Return a one-item list, with all films attached as a list of articles to this one feed. This causes an table of contents with a single entry in it. The example I've been cribbing off (The Atlantic) does this too.

This is the way to go since TOC will be shown with the list of articles on the reader.

A good example of what you want to accomplish can be found in several recipes I wrote.

For example recipe Vreme does exactly what you want to do. We have one page that lists all articles we want to put into feed. So I just parse them by specific condition appropriate to that page and put found data into only one feed.

Code:

    def parse_index(self):
        articles = []
        soup = self.index_to_soup(self.INDEX)        
        for item in soup.findAll(['h3','h4']):
            description = ''
            title_prefix = ''
            feed_link = item.find('a')
            if feed_link and feed_link.has_key('href') and feed_link['href'].startswith('/cms/view.php'):
                url   = self.INDEX + feed_link['href']
                title = title_prefix + self.tag_to_string(feed_link)
                date  = strftime(self.timefmt)                
                articles.append({
                                  'title'      :title
                                 ,'date'       :date
                                 ,'url'        :url
                                 ,'description':description
                                })
        return [(soup.head.title.string, articles)]

In your case it would look something like this:

Code:

    def parse_index(self):
        articles = []
        soup = self.index_to_soup('http://www.kulturinfo.ch/kino/db_front/showact.php')
        
        for item in soup.findAll('td',attrs={'class':'title'}):
            description = ''
            title_prefix = ''
            feed_link = item.find('a')
            if feed_link and feed_link.has_key('href'):
                unneeded, sep, purl = feed_link['href'].partition('..')
                url   = 'http://www.kulturinfo.ch/kino' + purl
                title = self.tag_to_string(feed_link)
                date  = strftime(self.timefmt)                
                articles.append({
                                  'title'      :title
                                 ,'date'       :date
                                 ,'url'        :url
                                 ,'description':description
                                })
        return [('Articles', articles)]

OlaNordmann · 04-18-2009, 09:44 PM

Quote:

Originally Posted by kiklop74

Complicating things.

Always keep it simple.

Code:

    def print_version(self, url):
        uneeded, sep, article_id = url.rpartition('artid=')
        return 'http://www.vg.no/pub/skrivervennlig.hbs?artid=' + article_id

Worked like a charm.. I really don't have a clue what I'm doing...
Anyway thanks alot, man

I'm so grateful..

ax42 · 04-19-2009, 06:29 AM

@kiklop - thanks! I suspected I was busy trying to reinvent the wheel. I'll clean up my script in the next few days and post it.

@kovidgoyal - sounds interesting, is createpdf in the reference docs somewhere?

BTW, where can I report a bug/fix an error in the online docs? The docs for BasicNewsRecipe.get_feeds() says "Return a list of :term:RSS feeds" which looks like a bug. See http://calibre.kovidgoyal.net/user_m...cipe.get_feeds

Thanks again
ax42

ax42 · 04-19-2009, 07:02 AM

@kiklop - I unfortunately can't run the Vreme recipe (requires a login). Does it result in a page with only one link on it called "Articles"? My code (concidentally) seems to be quite close to what you suggested already (unless I'm missing something). The recipe for the Atlantic also results in a single page with a "Current Issue" link, which comes from the way parse_index passes back the list of feeds.

Code:

def parse_index(self):
        films = []
        soup = self.index_to_soup(self.Index)
        for item in soup.findAll('td', attrs={'class':'title'}):
            if self.DEBUG: print 'i:', item, 's:', item.string
            description = ''

            a = item.find('a')
            if a == None: 
                self.title = item.string.replace('AKTUELLE FILMLISTE', 'ZH Cinema')
                if self.DEBUG: print 'title:', self.title

            else:
                if a.has_key('href'):
                    url = a['href'].replace('..', 'http://www.kulturinfo.ch/kino')
                    if self.DEBUG: print 'url:', url
                title = self.tag_to_string(a)
                films.append({
                                 'title':title,
                                 'date':'',
                                 'url':url,
                                 'description':description
                                })
                if self.DEBUG: print 'ls:', films[-1]
        if self.DEBUG: print 'ret:', ['x', films]
        return [('Filme', films)]

Any ideas?

ax42

pubolab · 04-19-2009, 08:32 AM

any chance of good Chinese recipes of zaobao.com?

http://realtime.zaobao.com/news.xml
http://www.zaobao.com/zg/zg.xml
http://www.zaobao.com/gj/gj.xml
http://www.zaobao.com/wencui/wencui.xml

Thanks a lot!

04-18-2009, 08:02 PM	#457
ax42 Member Posts: 13 Karma: 10 Join Date: Apr 2009 Location: Switzerland Device: PRS505	Cinema, take 2 Right, I'm making progress with my Zurich Cinemas script, but am running into a conceptual issue -- apologies if this is clear in the manual somewhere but so far I haven't found it. The page http://www.kulturinfo.ch/kino/db_front/showact.php contains a list of films. I would like this list to be the 'table of contents' of my eBook and each link to go to a page giving the film details (as happens when you click on the webpage link). I'm busy overriding parse_index to get a list of feeds but seem to be stuck between choosing one of the following two options: a) Return a list of films, which makes each film heading a feed with one article. This seems to lead to an intermediate page between the 'table of contents' and the actual film description, with this intermediate page having just the one film on it b) Return a one-item list, with all films attached as a list of articles to this one feed. This causes an table of contents with a single entry in it. The example I've been cribbing off (The Atlantic) does this too. Is there any way to not have either 'interstitials' like in a) or a single-entry ToC as in b)? If not, I'd probably choose b) as the lesser evil....... Thanks ax42

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Custom column read ?	pchrist7	Calibre	2	10-04-2010 03:52 AM
Archive for custom screensavers	sleeplessdave	Amazon Kindle	1	07-07-2010 01:33 PM
How to back up preferences and custom recipes?	greenapple	Calibre	3	03-29-2010 06:08 AM
Donations for Custom Recipes	ddavtian	Calibre	5	01-23-2010 05:54 PM
Help understanding custom recipes	andersent	Calibre	0	12-17-2009 03:37 PM

04-18-2009, 01:34 PM	#454
kovidgoyal creator of calibre Posts: 45,733 Karma: 28549306 Join Date: Oct 2006 Location: Mumbai, India Device: Various	@redp You can't do TXT output at the moment. The 0.6.0 release of calibre will support this, so you have to wait until then.

04-18-2009, 04:30 PM	#456
kovidgoyal creator of calibre Posts: 45,733 Karma: 28549306 Join Date: Oct 2006 Location: Mumbai, India Device: Various	The output of recipes is saved as HTML and then processed by the rest of the conversion system. You can certainly write an arbitrarily complex recipe that does whatever you want and then use it with the feeds2disk command to output, but you're on your own doing that

04-18-2009, 09:34 PM	#460
kovidgoyal creator of calibre Posts: 45,733 Karma: 28549306 Join Date: Oct 2006 Location: Mumbai, India Device: Various	@ax42 You can acheive whatever effect you want by overriding create_opf in your recipe

04-19-2009, 06:29 AM	#463
ax42 Member Posts: 13 Karma: 10 Join Date: Apr 2009 Location: Switzerland Device: PRS505	@kiklop - thanks! I suspected I was busy trying to reinvent the wheel. I'll clean up my script in the next few days and post it. @kovidgoyal - sounds interesting, is createpdf in the reference docs somewhere? BTW, where can I report a bug/fix an error in the online docs? The docs for BasicNewsRecipe.get_feeds() says "Return a list of :term:RSS feeds" which looks like a bug. See http://calibre.kovidgoyal.net/user_m...cipe.get_feeds Thanks again ax42

04-19-2009, 08:32 AM	#465
pubolab Member Posts: 17 Karma: 10 Join Date: May 2008 Device: CASIO pocket viewer S1600, Sony PRS-505 and Cybook Gen 3	any chance of good Chinese recipes of zaobao.com? http://realtime.zaobao.com/news.xml http://www.zaobao.com/zg/zg.xml http://www.zaobao.com/gj/gj.xml http://www.zaobao.com/wencui/wencui.xml Thanks a lot!