Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 10-27-2010, 03:35 AM   #1
BlonG
Member
BlonG began at the beginning.
 
BlonG's Avatar
 
Posts: 15
Karma: 10
Join Date: Oct 2010
Location: Slovenia
Device: Kindle 3G
Please help to clean-up recipe

As a newbie I try to learn how to create recipe by following examples in Calibre User manual.

For creating recipe from RSS – in order to get full article and not just summary – I should use Print version URL (in manual is example for “bbc.co.uk”). I have a problem that I can’t get the URL to full article, because the link is “javascript:window.print()”.

So, I tried different approach - by removing and keeping certain tags.
The problem is that now I don’t get the articles from specific section (each section has its own RSS URL). Articles are divided in sections, but they are all the same in different sections.

The recepit is here:
Spoiler:
#!/usr/bin/env python

__license__ = 'GPL v3'
__copyright__ = '2010'
'''
dnevnik.si
'''

from calibre.ebooks.BeautifulSoup import BeautifulSoup
from calibre.web.feeds.news import BasicNewsRecipe

class Dnevnik(BasicNewsRecipe):
title = u'Dnevnik.si'
__author__ = 'Test'
description = 'News'
oldest_article = 5
max_articles_per_feed = 20
no_stylesheets = True
use_embedded_content = False

cover_url = 'http://www.dnevnik.si/dsg/dnevnik.si.gif'

keep_only_tags = [dict(name='div' , attrs={'id':['content', 'heading']})]

remove_tags = [
dict(name='div' , attrs={'id':'header' })
,dict(name='div' , attrs={'class':['related', 'tools', 'inside' ]})
,dict(name='dl' ,attrs={'class':'ad'})
]

remove_tags_after = [dict(id='_iprom_inStream')]


feeds = [
(u'Izpostavljene novice' , u'http://www.dnevnik.si/rss/?articleType=9')
,(u'Slovenija' , u'http://www.dnevnik.si/rss/?articleType=13')
,(u'Svet' , u'http://www.dnevnik.si/rss/?articleType=14')
,(u'Kronika' , u'http://www.dnevnik.si/rss/?articleType=15')
,(u'Pop/kultura' , u'http://www.dnevnik.si/rss/?articleType=17')
,(u'Zdravje', u'http://www.dnevnik.si/rss/?articleType=18')
]


Link to Sections (RSS URL’s): http://www.dnevnik.si/kaj_je_rss
RSS link to specific section: http://www.dnevnik.si/rss/?articleTy...icleSection=14
Article link: http://www.dnevnik.si/novice/svet/1042398632
Print link (label “Natisni”): javascript:window.print()

Well, if this can be done without "remove" and "keep" tags - by using full article URL from "javascript" command, that would be perhaps better (and easier).

Another think: I still look for some kind expert to create recipe for magazine.
BlonG is offline   Reply With Quote
Old 10-27-2010, 04:19 AM   #2
marbs
Zealot
marbs began at the beginning.
 
Posts: 122
Karma: 10
Join Date: Jul 2010
Device: nook
try reading this.

if that does not work you could try using TamperData to see the request that is posted and recreate it.

is there any way to get to see the print version in your browser? not just have it spit out to the printer?

Last edited by marbs; 10-27-2010 at 04:24 AM.
marbs is offline   Reply With Quote
Old 10-27-2010, 05:27 AM   #3
BlonG
Member
BlonG began at the beginning.
 
BlonG's Avatar
 
Posts: 15
Karma: 10
Join Date: Oct 2010
Location: Slovenia
Device: Kindle 3G
Thank you for the link. I did read it - and understood very little. However in the 4. Getting obfuscated content part they mention JavaScript function. But where should I copy this (if I add this to my recipe, then I get an error). So I think, there is something missing.

I did some changes to recipe - just copy&paste from that instructions:
Spoiler:
#!/usr/bin/env python

__license__ = 'GPL v3'
__copyright__ = '2010'
'''
dnevnik.si
'''

from calibre.ebooks.BeautifulSoup import BeautifulSoup
from calibre.web.feeds.news import BasicNewsRecipe

class Dnevnik(BasicNewsRecipe):
title = u'Dnevnik.si'
__author__ = 'Test'
description = 'News'
oldest_article = 5
max_articles_per_feed = 20
no_stylesheets = True
use_embedded_content = False
articles_are_obfuscated = True

def get_obfuscated_article(self, url):
br = self.get_browser()
br.open(url)
import mechanize
print_url = url + '?version=print'
response = br.follow_link(mechanize.Link(base_url = '', url = print_url, text = '', tag = '', attrs = []))

html = response.read()

self.temp_files.append(PersistentTemporaryFile('_f a.html'))
self.temp_files[-1].write(html)
self.temp_files[-1].close()

return self.temp_files[-1].name

cover_url = 'http://www.dnevnik.si/dsg/dnevnik.si.gif'

keep_only_tags = [dict(name='div' , attrs={'id':['content', 'heading']})]

remove_tags = [
dict(name='div' , attrs={'id':'header' })
,dict(name='div' , attrs={'class':['related', 'tools', 'inside' ]})
,dict(name='dl' ,attrs={'class':'ad'})
]

remove_tags_after = [dict(id='_iprom_inStream')]


feeds = [
(u'Izpostavljene novice' , u'http://www.dnevnik.si/rss/?articleType=9')
,(u'Slovenija' , u'http://www.dnevnik.si/rss/?articleType=13')
,(u'Svet' , u'http://www.dnevnik.si/rss/?articleType=14')
,(u'Kronika', u'http://www.dnevnik.si/rss/?articleType=15')
,(u'Pop/kultura', u'http://www.dnevnik.si/rss/?articleType=17')
,(u'Zdravje', u'http://www.dnevnik.si/rss/?articleType=18')
]


Now, my ebook is empty.

I tried TamperData and the "javascript:window.print()" calls for same URL of the article. So there is no way - at least I don't know any - to see the "print version" in browser.
BlonG is offline   Reply With Quote
Old 10-27-2010, 05:57 AM   #4
BlonG
Member
BlonG began at the beginning.
 
BlonG's Avatar
 
Posts: 15
Karma: 10
Join Date: Oct 2010
Location: Slovenia
Device: Kindle 3G
OK, I solved the problem.

Didn't bother with javascript, because I didn't understand that stuff. So I just searched for tags and followed other examples.

The recipe is for Slovenian newspaper "Dnevnik.si"
Quote:
__license__ = 'GPL v3'
__copyright__ = '2010, BlonG'
'''
dnevnik.si
'''
from calibre.web.feeds.news import BasicNewsRecipe
class Dnevnik(BasicNewsRecipe):
title = u'Dnevnik.si'
__author__ = u'BlonG'
description = u"Dnevnik je časnik z več kot polstoletno zgodovino. Pod sloganom »Življenje ima besedo« na svojih straneh prinaša bralcem bogastvo informacij, komentarjev in kolumen in raznovrstnost pogledov, zaznamovanih z odgovornostjo do posameznika in širše družbe."
oldest_article = 3
max_articles_per_feed = 20
no_stylesheets = True
use_embedded_content = False

cover_url = 'http://dnk.dnevnik.si/media/uploads/_custom/dnevnik_casopisna_druzba.jpg'

extra_css = '''
h1{font-family:Arial,Helvetica,sans-serif; font-weight:bold;font-size:large;}
h2{font-family:Arial,Helvetica,sans-serif; font-weight:normal;font-size:small;}
p{font-family:Arial,Helvetica,sans-serif;font-size:small;}
body{font-family:Helvetica,Arial,sans-serif;font-size:small;}
'''

html2lrf_options = ['--base-font-size', '10']

keep_only_tags = [
dict(name='div', attrs={'id':'_iprom_inStream'}),
dict(name='div', attrs={'class':'entry-content'}),
]

remove_tags = [
dict(name='div', attrs={'class':'fb_article_top'}),
dict(name='div', attrs={'class':'related'}),
dict(name='div', attrs={'class':'fb_article_foot'}),
dict(name='div', attrs={'class':'spreading'}),
dict(name='dl', attrs={'class':'ad'}),
dict(name='p', attrs={'class':'report'}),
dict(name='div', attrs={'class':'hfeed comments'}),
dict(name='dl', attrs={'id':'entryPanel'}),
dict(name='dl', attrs={'class':'infopush ip_wide'}),
dict(name='div', attrs={'class':'sidebar'}),
dict(name='dl', attrs={'class':'bottom'}),
dict(name='div', attrs={'id':'footer'}),
]


feeds = [
(u'Slovenija', u'http://www.dnevnik.si/rss/?articleType=1&articleSection=13')
,(u'Svet', u'http://www.dnevnik.si/rss/?articleType=1&articleSection=14')
,(u'EU', u'http://www.dnevnik.si/rss/?articleType=1&articleSection=116')
,(u'Poslovni dnevnik', u'http://www.dnevnik.si/rss/?articleType=1&articleSection=5')
,(u'Kronika', u'http://www.dnevnik.si/rss/?articleType=1&articleSection=15')
,(u'Kultura', u'http://www.dnevnik.si/rss/?articleType=1&articleSection=17')
,(u'Zdravje', u'http://www.dnevnik.si/rss/?articleType=1&articleSection=18')
,(u'Znanost in IT', u'http://www.dnevnik.si/rss/?articleType=1&articleSection=19')
,(u'(Ne)verjetno', u'http://www.dnevnik.si/rss/?articleType=1&articleSection=20')
,(u'E-strada', u'http://www.dnevnik.si/rss/?articleType=1&articleSection=21')
,(u'Svet vozil', u'http://www.dnevnik.si/rss/?articleType=1&articleSection=22')
]



If anybody has some idea how to improve this - please just comment!

Last edited by BlonG; 10-27-2010 at 11:59 PM. Reason: Managed to create recipe
BlonG is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Screen clean haino Sony Reader 1 04-25-2010 12:44 PM
The best way to clean a white PP? Dr. Drib Astak EZReader 6 02-10-2010 02:26 AM
PRS-600 How should i clean the screen? sazono Sony Reader 13 09-13-2009 01:16 PM
Best way to get clean HTML JSWolf Kindle Formats 18 04-02-2009 11:00 AM
How to clean lightwedge PsyDocJoanne Sony Reader 9 10-01-2008 07:03 PM


All times are GMT -4. The time now is 07:07 PM.


MobileRead.com is a privately owned, operated and funded community.