Please help to clean-up recipe

BlonG · 10-27-2010, 03:35 AM

As a newbie I try to learn how to create recipe by following examples in Calibre User manual.

For creating recipe from RSS – in order to get full article and not just summary – I should use Print version URL (in manual is example for “bbc.co.uk”). I have a problem that I can’t get the URL to full article, because the link is “javascript:window.print()”.

So, I tried different approach - by removing and keeping certain tags.
The problem is that now I don’t get the articles from specific section (each section has its own RSS URL). Articles are divided in sections, but they are all the same in different sections.

The recepit is here:

Spoiler:

Link to Sections (RSS URL’s): http://www.dnevnik.si/kaj_je_rss
RSS link to specific section: http://www.dnevnik.si/rss/?articleTy...icleSection=14
Article link: http://www.dnevnik.si/novice/svet/1042398632
Print link (label “Natisni”): javascript:window.print()

Well, if this can be done without "remove" and "keep" tags - by using full article URL from "javascript" command, that would be perhaps better (and easier).

Another think: I still look for some kind expert to create recipe for magazine.

marbs · 10-27-2010, 04:19 AM

try reading this.

if that does not work you could try using TamperData to see the request that is posted and recreate it.

is there any way to get to see the print version in your browser? not just have it spit out to the printer?

BlonG · 10-27-2010, 05:27 AM

Thank you for the link. I did read it - and understood very little. However in the 4. Getting obfuscated content part they mention JavaScript function. But where should I copy this (if I add this to my recipe, then I get an error). So I think, there is something missing.

I did some changes to recipe - just copy&paste from that instructions:

Spoiler:

Now, my ebook is empty.

I tried TamperData and the "javascript:window.print()" calls for same URL of the article. So there is no way - at least I don't know any - to see the "print version" in browser.

BlonG · 10-27-2010, 05:57 AM

OK, I solved the problem.

Didn't bother with javascript, because I didn't understand that stuff. So I just searched for tags and followed other examples.

The recipe is for Slovenian newspaper "Dnevnik.si"

Quote:

__license__ = 'GPL v3'
__copyright__ = '2010, BlonG'
'''
dnevnik.si
'''
from calibre.web.feeds.news import BasicNewsRecipe
class Dnevnik(BasicNewsRecipe):
title = u'Dnevnik.si'
__author__ = u'BlonG'
description = u"Dnevnik je časnik z več kot polstoletno zgodovino. Pod sloganom »Življenje ima besedo« na svojih straneh prinaša bralcem bogastvo informacij, komentarjev in kolumen in raznovrstnost pogledov, zaznamovanih z odgovornostjo do posameznika in širše družbe."
oldest_article = 3
max_articles_per_feed = 20
no_stylesheets = True
use_embedded_content = False

cover_url = 'http://dnk.dnevnik.si/media/uploads/_custom/dnevnik_casopisna_druzba.jpg'

extra_css = '''
h1{font-family:Arial,Helvetica,sans-serif; font-weight:bold;font-size:large;}
h2{font-family:Arial,Helvetica,sans-serif; font-weight:normal;font-size:small;}
p{font-family:Arial,Helvetica,sans-serif;font-size:small;}
body{font-family:Helvetica,Arial,sans-serif;font-size:small;}
'''

html2lrf_options = ['--base-font-size', '10']

keep_only_tags = [
dict(name='div', attrs={'id':'_iprom_inStream'}),
dict(name='div', attrs={'class':'entry-content'}),
]

remove_tags = [
dict(name='div', attrs={'class':'fb_article_top'}),
dict(name='div', attrs={'class':'related'}),
dict(name='div', attrs={'class':'fb_article_foot'}),
dict(name='div', attrs={'class':'spreading'}),
dict(name='dl', attrs={'class':'ad'}),
dict(name='p', attrs={'class':'report'}),
dict(name='div', attrs={'class':'hfeed comments'}),
dict(name='dl', attrs={'id':'entryPanel'}),
dict(name='dl', attrs={'class':'infopush ip_wide'}),
dict(name='div', attrs={'class':'sidebar'}),
dict(name='dl', attrs={'class':'bottom'}),
dict(name='div', attrs={'id':'footer'}),
]

feeds = [
(u'Slovenija', u'http://www.dnevnik.si/rss/?articleType=1&articleSection=13')
,(u'Svet', u'http://www.dnevnik.si/rss/?articleType=1&articleSection=14')
,(u'EU', u'http://www.dnevnik.si/rss/?articleType=1&articleSection=116')
,(u'Poslovni dnevnik', u'http://www.dnevnik.si/rss/?articleType=1&articleSection=5')
,(u'Kronika', u'http://www.dnevnik.si/rss/?articleType=1&articleSection=15')
,(u'Kultura', u'http://www.dnevnik.si/rss/?articleType=1&articleSection=17')
,(u'Zdravje', u'http://www.dnevnik.si/rss/?articleType=1&articleSection=18')
,(u'Znanost in IT', u'http://www.dnevnik.si/rss/?articleType=1&articleSection=19')
,(u'(Ne)verjetno', u'http://www.dnevnik.si/rss/?articleType=1&articleSection=20')
,(u'E-strada', u'http://www.dnevnik.si/rss/?articleType=1&articleSection=21')
,(u'Svet vozil', u'http://www.dnevnik.si/rss/?articleType=1&articleSection=22')
]

If anybody has some idea how to improve this - please just comment!

10-27-2010, 03:35 AM	#1
BlonG Member Posts: 15 Karma: 10 Join Date: Oct 2010 Location: Slovenia Device: Kindle 3G	Please help to clean-up recipe As a newbie I try to learn how to create recipe by following examples in Calibre User manual. For creating recipe from RSS – in order to get full article and not just summary – I should use Print version URL (in manual is example for “bbc.co.uk”). I have a problem that I can’t get the URL to full article, because the link is “javascript:window.print()”. So, I tried different approach - by removing and keeping certain tags. The problem is that now I don’t get the articles from specific section (each section has its own RSS URL). Articles are divided in sections, but they are all the same in different sections. The recepit is here: Spoiler: #!/usr/bin/env python __license__ = 'GPL v3' __copyright__ = '2010' ''' dnevnik.si ''' from calibre.ebooks.BeautifulSoup import BeautifulSoup from calibre.web.feeds.news import BasicNewsRecipe class Dnevnik(BasicNewsRecipe): title = u'Dnevnik.si' __author__ = 'Test' description = 'News' oldest_article = 5 max_articles_per_feed = 20 no_stylesheets = True use_embedded_content = False cover_url = 'http://www.dnevnik.si/dsg/dnevnik.si.gif' keep_only_tags = [dict(name='div' , attrs={'id':['content', 'heading']})] remove_tags = [ dict(name='div' , attrs={'id':'header' }) ,dict(name='div' , attrs={'class':['related', 'tools', 'inside' ]}) ,dict(name='dl' ,attrs={'class':'ad'}) ] remove_tags_after = [dict(id='_iprom_inStream')] feeds = [ (u'Izpostavljene novice' , u'http://www.dnevnik.si/rss/?articleType=9') ,(u'Slovenija' , u'http://www.dnevnik.si/rss/?articleType=13') ,(u'Svet' , u'http://www.dnevnik.si/rss/?articleType=14') ,(u'Kronika' , u'http://www.dnevnik.si/rss/?articleType=15') ,(u'Pop/kultura' , u'http://www.dnevnik.si/rss/?articleType=17') ,(u'Zdravje', u'http://www.dnevnik.si/rss/?articleType=18') ] Link to Sections (RSS URL’s): http://www.dnevnik.si/kaj_je_rss RSS link to specific section: http://www.dnevnik.si/rss/?articleTy...icleSection=14 Article link: http://www.dnevnik.si/novice/svet/1042398632 Print link (label “Natisni”): javascript:window.print() Well, if this can be done without "remove" and "keep" tags - by using full article URL from "javascript" command, that would be perhaps better (and easier). Another think: I still look for some kind expert to create recipe for magazine.

10-27-2010, 04:19 AM	#2
marbs Zealot Posts: 122 Karma: 10 Join Date: Jul 2010 Device: nook	try reading this. if that does not work you could try using TamperData to see the request that is posted and recreate it. is there any way to get to see the print version in your browser? not just have it spit out to the printer? Last edited by marbs; 10-27-2010 at 04:24 AM.

10-27-2010, 05:27 AM	#3
BlonG Member Posts: 15 Karma: 10 Join Date: Oct 2010 Location: Slovenia Device: Kindle 3G	Thank you for the link. I did read it - and understood very little. However in the 4. Getting obfuscated content part they mention JavaScript function. But where should I copy this (if I add this to my recipe, then I get an error). So I think, there is something missing. I did some changes to recipe - just copy&paste from that instructions: Spoiler: #!/usr/bin/env python __license__ = 'GPL v3' __copyright__ = '2010' ''' dnevnik.si ''' from calibre.ebooks.BeautifulSoup import BeautifulSoup from calibre.web.feeds.news import BasicNewsRecipe class Dnevnik(BasicNewsRecipe): title = u'Dnevnik.si' __author__ = 'Test' description = 'News' oldest_article = 5 max_articles_per_feed = 20 no_stylesheets = True use_embedded_content = False articles_are_obfuscated = True def get_obfuscated_article(self, url): br = self.get_browser() br.open(url) import mechanize print_url = url + '?version=print' response = br.follow_link(mechanize.Link(base_url = '', url = print_url, text = '', tag = '', attrs = [])) html = response.read() self.temp_files.append(PersistentTemporaryFile('_f a.html')) self.temp_files[-1].write(html) self.temp_files[-1].close() return self.temp_files[-1].name cover_url = 'http://www.dnevnik.si/dsg/dnevnik.si.gif' keep_only_tags = [dict(name='div' , attrs={'id':['content', 'heading']})] remove_tags = [ dict(name='div' , attrs={'id':'header' }) ,dict(name='div' , attrs={'class':['related', 'tools', 'inside' ]}) ,dict(name='dl' ,attrs={'class':'ad'}) ] remove_tags_after = [dict(id='_iprom_inStream')] feeds = [ (u'Izpostavljene novice' , u'http://www.dnevnik.si/rss/?articleType=9') ,(u'Slovenija' , u'http://www.dnevnik.si/rss/?articleType=13') ,(u'Svet' , u'http://www.dnevnik.si/rss/?articleType=14') ,(u'Kronika', u'http://www.dnevnik.si/rss/?articleType=15') ,(u'Pop/kultura', u'http://www.dnevnik.si/rss/?articleType=17') ,(u'Zdravje', u'http://www.dnevnik.si/rss/?articleType=18') ] Now, my ebook is empty. I tried TamperData and the "javascript:window.print()" calls for same URL of the article. So there is no way - at least I don't know any - to see the "print version" in browser.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Screen clean	haino	Sony Reader	1	04-25-2010 12:44 PM
The best way to clean a white PP?	Dr. Drib	Astak EZReader	6	02-10-2010 02:26 AM
PRS-600 How should i clean the screen?	sazono	Sony Reader	13	09-13-2009 01:16 PM
Best way to get clean HTML	JSWolf	Kindle Formats	18	04-02-2009 11:00 AM
How to clean lightwedge	PsyDocJoanne	Sony Reader	9	10-01-2008 07:03 PM