View Full Version : Please help to clean-up recipe


BlonG
10-27-2010, 03:35 AM
As a newbie I try to learn how to create recipe by following examples in Calibre User manual.

For creating recipe from RSS – in order to get full article and not just summary – I should use Print version URL (in manual is example for “bbc.co.uk”). I have a problem that I can’t get the URL to full article, because the link is “javascript:window.print()”.

So, I tried different approach - by removing and keeping certain tags.
The problem is that now I don’t get the articles from specific section (each section has its own RSS URL). Articles are divided in sections, but they are all the same in different sections.

The recepit is here: #!/usr/bin/env python

__license__ = 'GPL v3'
__copyright__ = '2010'
'''
dnevnik.si
'''

from calibre.ebooks.BeautifulSoup import BeautifulSoup
from calibre.web.feeds.news import BasicNewsRecipe

class Dnevnik(BasicNewsRecipe):
title = u'Dnevnik.si'
__author__ = 'Test'
description = 'News'
oldest_article = 5
max_articles_per_feed = 20
no_stylesheets = True
use_embedded_content = False

cover_url = 'http://www.dnevnik.si/dsg/dnevnik.si.gif'

keep_only_tags = [dict(name='div' , attrs={'id':['content', 'heading']})]

remove_tags = [
dict(name='div' , attrs={'id':'header' })
,dict(name='div' , attrs={'class':['related', 'tools', 'inside' ]})
,dict(name='dl' ,attrs={'class':'ad'})
]

remove_tags_after = [dict(id='_iprom_inStream')]


feeds = [
(u'Izpostavljene novice' , u'http://www.dnevnik.si/rss/?articleType=9')
,(u'Slovenija' , u'http://www.dnevnik.si/rss/?articleType=13')
,(u'Svet' , u'http://www.dnevnik.si/rss/?articleType=14')
,(u'Kronika' , u'http://www.dnevnik.si/rss/?articleType=15')
,(u'Pop/kultura' , u'http://www.dnevnik.si/rss/?articleType=17')
,(u'Zdravje', u'http://www.dnevnik.si/rss/?articleType=18')
]


Link to Sections (RSS URL’s): http://www.dnevnik.si/kaj_je_rss
RSS link to specific section: http://www.dnevnik.si/rss/?articleType=1&articleSection=14
Article link: http://www.dnevnik.si/novice/svet/1042398632
Print link (label “Natisni”): javascript:window.print()

Well, if this can be done without "remove" and "keep" tags - by using full article URL from "javascript" command, that would be perhaps better (and easier).

Another think: I still look for some kind expert to create recipe for magazine (http://www.mobileread.com/forums/showthread.php?t=104118).

marbs
10-27-2010, 04:19 AM
try reading this (http://bugs.calibre-ebook.com/wiki/recipeGuide_advanced).

if that does not work you could try using TamperData to see the request that is posted and recreate it.

is there any way to get to see the print version in your browser? not just have it spit out to the printer?

BlonG
10-27-2010, 05:27 AM
Thank you for the link. I did read it - and understood very little. However in the 4. Getting obfuscated content (http://bugs.calibre-ebook.com/wiki/recipeGuide_advanced#a4.Gettingobfuscatedcontent) part they mention JavaScript function. But where should I copy this (if I add this to my recipe, then I get an error). So I think, there is something missing.

I did some changes to recipe - just copy&paste from that instructions:
#!/usr/bin/env python

__license__ = 'GPL v3'
__copyright__ = '2010'
'''
dnevnik.si
'''

from calibre.ebooks.BeautifulSoup import BeautifulSoup
from calibre.web.feeds.news import BasicNewsRecipe

class Dnevnik(BasicNewsRecipe):
title = u'Dnevnik.si'
__author__ = 'Test'
description = 'News'
oldest_article = 5
max_articles_per_feed = 20
no_stylesheets = True
use_embedded_content = False
articles_are_obfuscated = True

def get_obfuscated_article(self, url):
br = self.get_browser()
br.open(url)
import mechanize
print_url = url + '?version=print'
response = br.follow_link(mechanize.Link(base_url = '', url = print_url, text = '', tag = '', attrs = []))

html = response.read()

self.temp_files.append(PersistentTemporaryFile('_f a.html'))
self.temp_files[-1].write(html)
self.temp_files[-1].close()

return self.temp_files[-1].name

cover_url = 'http://www.dnevnik.si/dsg/dnevnik.si.gif'

keep_only_tags = [dict(name='div' , attrs={'id':['content', 'heading']})]

remove_tags = [
dict(name='div' , attrs={'id':'header' })
,dict(name='div' , attrs={'class':['related', 'tools', 'inside' ]})
,dict(name='dl' ,attrs={'class':'ad'})
]

remove_tags_after = [dict(id='_iprom_inStream')]


feeds = [
(u'Izpostavljene novice' , u'http://www.dnevnik.si/rss/?articleType=9')
,(u'Slovenija' , u'http://www.dnevnik.si/rss/?articleType=13')
,(u'Svet' , u'http://www.dnevnik.si/rss/?articleType=14')
,(u'Kronika', u'http://www.dnevnik.si/rss/?articleType=15')
,(u'Pop/kultura', u'http://www.dnevnik.si/rss/?articleType=17')
,(u'Zdravje', u'http://www.dnevnik.si/rss/?articleType=18')
]


Now, my ebook is empty. :(

I tried TamperData and the "javascript:window.print()" calls for same URL of the article. So there is no way - at least I don't know any - to see the "print version" in browser.

BlonG
10-27-2010, 05:57 AM
OK, I solved the problem. :thumbsup:

Didn't bother with javascript, because I didn't understand that stuff. So I just searched for tags and followed other examples.

The recipe is for Slovenian newspaper "Dnevnik.si (http://www.dnevnik.si/)"

__license__ = 'GPL v3'
__copyright__ = '2010, BlonG'
'''
dnevnik.si
'''
from calibre.web.feeds.news import BasicNewsRecipe
class Dnevnik(BasicNewsRecipe):
title = u'Dnevnik.si'
__author__ = u'BlonG'
description = u"Dnevnik je časnik z več kot polstoletno zgodovino. Pod sloganom »Življenje ima besedo« na svojih straneh prinaša bralcem bogastvo informacij, komentarjev in kolumen in raznovrstnost pogledov, zaznamovanih z odgovornostjo do posameznika in širše družbe."
oldest_article = 3
max_articles_per_feed = 20
no_stylesheets = True
use_embedded_content = False

cover_url = 'http://dnk.dnevnik.si/media/uploads/_custom/dnevnik_casopisna_druzba.jpg'

extra_css = '''
h1{font-family:Arial,Helvetica,sans-serif; font-weight:bold;font-size:large;}
h2{font-family:Arial,Helvetica,sans-serif; font-weight:normal;font-size:small;}
p{font-family:Arial,Helvetica,sans-serif;font-size:small;}
body{font-family:Helvetica,Arial,sans-serif;font-size:small;}
'''

html2lrf_options = ['--base-font-size', '10']

keep_only_tags = [
dict(name='div', attrs={'id':'_iprom_inStream'}),
dict(name='div', attrs={'class':'entry-content'}),
]

remove_tags = [
dict(name='div', attrs={'class':'fb_article_top'}),
dict(name='div', attrs={'class':'related'}),
dict(name='div', attrs={'class':'fb_article_foot'}),
dict(name='div', attrs={'class':'spreading'}),
dict(name='dl', attrs={'class':'ad'}),
dict(name='p', attrs={'class':'report'}),
dict(name='div', attrs={'class':'hfeed comments'}),
dict(name='dl', attrs={'id':'entryPanel'}),
dict(name='dl', attrs={'class':'infopush ip_wide'}),
dict(name='div', attrs={'class':'sidebar'}),
dict(name='dl', attrs={'class':'bottom'}),
dict(name='div', attrs={'id':'footer'}),
]


feeds = [
(u'Slovenija', u'http://www.dnevnik.si/rss/?articleType=1&articleSection=13')
,(u'Svet', u'http://www.dnevnik.si/rss/?articleType=1&articleSection=14')
,(u'EU', u'http://www.dnevnik.si/rss/?articleType=1&articleSection=116')
,(u'Poslovni dnevnik', u'http://www.dnevnik.si/rss/?articleType=1&articleSection=5')
,(u'Kronika', u'http://www.dnevnik.si/rss/?articleType=1&articleSection=15')
,(u'Kultura', u'http://www.dnevnik.si/rss/?articleType=1&articleSection=17')
,(u'Zdravje', u'http://www.dnevnik.si/rss/?articleType=1&articleSection=18')
,(u'Znanost in IT', u'http://www.dnevnik.si/rss/?articleType=1&articleSection=19')
,(u'(Ne)verjetno', u'http://www.dnevnik.si/rss/?articleType=1&articleSection=20')
,(u'E-strada', u'http://www.dnevnik.si/rss/?articleType=1&articleSection=21')
,(u'Svet vozil', u'http://www.dnevnik.si/rss/?articleType=1&articleSection=22')
]






If anybody has some idea how to improve this - please just comment!