Custom recipes (archive, read-only) - Page 90

kiklop74 · 02-03-2010, 06:35 AM

New recipe for TidBITS:

kiklop74 · 02-03-2010, 06:52 AM

New recipe for Gizmodo:

kiklop74 · 02-03-2010, 07:16 AM

New recipe for News Straits Times from Malaysia:

TBR · 02-03-2010, 08:04 AM

I'm still having trouble to get a recipe for

http://p.yimg.com/bw/rss/nachrichten/bundeswehr.xml

cleared of unnecessary clutter, am still getting artifacts.
The modified basic news recipe works in principle and removes much of the clutter but still includes, among others, a "ghost" of an add:

Quote:

class AdvancedUserRecipe1264591440(BasicNewsRecipe):
title = u'Bundeswehr'
oldest_article = 7
max_articles_per_feed = 100
remove_tags_after = dict(name='div', attrs={'id':'content'})
remove_tags_before = dict(name='div', attrs={'id':'content'})
feeds = [(u'Bundeswehr in AFP und AP', u'http://p.yimg.com/bw/rss/nachrichten/bundeswehr.xml')]

Could anyone jump in with advice?

I want to get a "filtered" recipe going to scan several rss-feeds and filter out all articles that don't contain certain keywords so that only news items that do contain those keywords are included in the created e-book, thus creating an instant press review on a certain theme/person/event etc. Kovidgoyal has confirmed the possibility of doing this with calibre:

Quote:

Originally Posted by kovidgoyal

If you've seen http://bazaar.launchpad.net/~kovid/c.../feeds/news.py

there's not much more I can tell you. Basically, you can completely customize the news download process by overring the methods of that class. So if you want to create a compsite recipe you would create a parse_index method that will list all the current articles in your various news sources. Then you would override postprocess_html to check for the required keywords and if absent return None

but I'm afraid that this is currently beyond my programming/scripting skills. As this would be a rather extensive recipe I'm hesitant to simply request it in this forum but could someone post a recipe with a keyword filter so I can learn from the example?

kiklop74 · 02-03-2010, 09:32 AM

Quote:

Originally Posted by TBR

I'm still having trouble to get a recipe for

http://p.yimg.com/bw/rss/nachrichten/bundeswehr.xml

cleared of unnecessary clutter, am still getting artifacts.
The modified basic news recipe works in principle and removes much of the clutter but still includes, among others, a "ghost" of an add:

Could anyone jump in with advice?

This is what you should put in your recipe for complete cleanup:

Code:

    remove_attributes  = ['width','height']
    remove_tags_before = dict(name='h1')
    remove_tags_after  = dict(name='div',attrs={'class':'ynw-article-body mod'})
    remove_tags        = [
                            dict(attrs={'id':['ynw-image-video-inset','ynw-more-news']})
                           ,dict(attrs={'class':['ynw-utility']})
                         ]

kiklop74 · 02-03-2010, 10:04 AM

New recipe for Read It Later website:

Denny_ · 02-03-2010, 02:21 PM

In trying to create a custom recipe I got as far as posting the feeds and getting the print version but I'm having trouble cleaning up the extra links at the bottome of each article. At the end of the article the HTML file looks like:

</p>
</div>
<div class="print-logo"></div>
<hr class="calibre3"/>
<div class="print-logo"></div>
<div class="print-logo">
<p class="calibre5"><a href="https://www.neodata.com/ITPS2.cgi?OrderType=Reply+Only&ItemCode=WSTD&a mp;iResponse=WSTD.NEW">Subscribe now to The Weekly Standard!</a></p>
<p class="calibre5"><b class="calibre6">Get more from The Weekly Standard:</b> <a href="/feeds">Follow WeeklyStandard.com on RSS</a> and <a href="/newsletter/requestform.asp">sign-up for our free Newsletter.</a></p>
<p class="calibre5"><a href="/tws/advertising/default.asp">Contact our advertising team</a> for advertising and sponsorship on WeeklyStandard.com or in <b class="calibre6">The Weekly Standard.</b></p>
<p class="calibre5">Copyright 2010 Weekly Standard LLC.</p>
</div>
<hr class="calibre3"/>
<div class="print-logo"><strong class="calibre6">Source URL:</strong> <a href="http://www.theweeklystandard.com/blogs/obama-halts-nasas-constellation-program">http://www.theweeklystandard.com/blogs/obama-halts-nasas-constellation-program</a></div>
<div class="print-logo"></div>
<div class="navbar1">
<hr class="calibre3"/>
<p class="calibre7">
This article was downloaded by <b class="calibre6">calibre</b> from <a href="http://www.theweeklystandard.com/blogs/obama-halts-nasas-constellation-program">http://www.theweeklystandard.com/blogs/obama-halts-nasas-constellation-program</a>
</p>
<br class="print-logo"/><br class="print-logo"/>
| <a href="../index.html#article_0">Section menu</a>
|
</div></body>
</html>

Can someone tell me how best to eliminate this?

Thanks,

Denny

kiklop74 · 02-03-2010, 02:31 PM

Just add this to your recipe

Quote:

keep_only_tags = [dict(attrs={'class':['print-title','print-subtitle','print-author','print-date-issue','print-content']})]

gafleh · 02-03-2010, 03:11 PM

Hi Everybody ! I am New here but for 2 days I have been around this wonderfull site.

May I request a help of recipe for
http://www.islamqa.com/en/rss.xml

Thank You Once Again

exdream · 02-03-2010, 04:50 PM

Hi

I try to make a recipe for http://szmobil.sueddeutsche.de/ This ist the code up to now (with which I get - IndexError: list index out of range -Error Code: 1). Am I on the right way with that? Can somebody please tell me what is wrong.

...

def parse_index(self):
feeds = []
for title, url in [('Politik', 'http://szmobil.sueddeutsche.de/show.php?section=Politik'),
('Seite Drei', 'http://szmobil.sueddeutsche.de/show.php?section=Seite+drei'),
('Meinungsseite', 'http://szmobil.sueddeutsche.de/show.php?section=Meinungsseite'),
('Panorama', 'http://szmobil.sueddeutsche.de/show.php?section=Panorama'),
('Feuilleton', 'http://szmobil.sueddeutsche.de/show.php?section=Feuilleton'),
('Medien', 'http://szmobil.sueddeutsche.de/show.php?section=Medien'),
('Wissen', 'http://szmobil.sueddeutsche.de/show.php?section=Wissen'),
('Wirtschaft', u'http://szmobil.sueddeutsche.de/show.php?section=Wirtschaft'),
('Sport', u'http://szmobil.sueddeutsche.de/show.php?section=Sport'),
('Muenchen-Bayern', u'http://szmobil.sueddeutsche.de/show.php?section=M%FCnchen%2FBayern'),
]:
articles = self.nz_parse_section(url)
if articles:
feeds.append((title, articles))
return feeds

def nz_parse_section(self, url):
soup = self.index_to_soup(url)
# div = soup.find(attrs={'class': 'col-300 categoryList'})
# date = div.find(attrs={'class': 'link-list-heading'})

current_articles = []
# for tag in date.findAllNext(attrs = {'class': ['linkList', 'link-list-heading']}):
# if tag.get('class') == 'link-list-heading':
# break
for li in soup.findAll('li'):
a = li.find('a', href = True)
if a is None:
continue
title = self.tag_to_string(a)
url = a.get('href', False)
if not url or not title:
continue
# if url.startswith('/'):
# url = 'http://www.nzherald.co.nz'+url
self.log('\t\tFound article:', title)
self.log('\t\t\t', url)
current_articles.append({'title': title, 'url': url, 'description':'', 'date':''})

return current_articles

cix3 · 02-03-2010, 06:54 PM

Fix for a stylesheet issue; add feed for The Book

Code:

from calibre.web.feeds.news import BasicNewsRecipe

class The_New_Republic(BasicNewsRecipe):
    title = 'The New Republic'
    __author__ = 'cix3'
    language = 'en'
    description = 'Intelligent, stimulating and rigorous examination of American politics, foreign policy and culture'
    timefmt = ' [%b %d, %Y]'

    oldest_article = 7
    max_articles_per_feed = 100
    no_stylesheets = True

    remove_tags = [
            dict(name='div', attrs={'class':['print-logo', 'print-site_name', 'img-left', 'print-source_url']}),
            dict(name='hr', attrs={'class':'print-hr'}), dict(name='img')
            ]

    feeds = [
        ('Politics', 'http://www.tnr.com/rss/articles/Politics'),
        ('Books and Arts', 'http://www.tnr.com/rss/articles/Books-and-Arts'),
        ('Economy', 'http://www.tnr.com/rss/articles/Economy'),
        ('Environment and Energy', 'http://www.tnr.com/rss/articles/Environment-%2526-Energy'),
        ('Health Care', 'http://www.tnr.com/rss/articles/Health-Care'),
        ('Metro Policy', 'http://www.tnr.com/rss/articles/Metro-Policy'),
        ('World', 'http://www.tnr.com/rss/articles/World'),
        ('Film', 'http://www.tnr.com/rss/articles/Film'),
        ('Books', 'http://www.tnr.com/rss/articles/books'),
        ('The Book', 'http://www.tnr.com/rss/book'),
        ('Jonathan Chait', 'http://www.tnr.com/rss/blogs/Jonathan-Chait'),
        ('The Plank', 'http://www.tnr.com/rss/blogs/The-Plank'),
        ('The Treatment', 'http://www.tnr.com/rss/blogs/The-Treatment'),
        ('The Spine', 'http://www.tnr.com/rss/blogs/The-Spine'),
        ('The Vine', 'http://www.tnr.com/rss/blogs/The-Vine'),
        ('The Avenue', 'http://www.tnr.com/rss/blogs/The-Avenue'),
        ('William Galston', 'http://www.tnr.com/rss/blogs/William-Galston'),
        ('Simon Johnson', 'http://www.tnr.com/rss/blogs/Simon-Johnson'),
        ('Ed Kilgore', 'http://www.tnr.com/rss/blogs/Ed-Kilgore'),
        ('Damon Linker', 'http://www.tnr.com/rss/blogs/Damon-Linker'),
        ('John McWhorter', 'http://www.tnr.com/rss/blogs/John-McWhorter')
            ]

    def print_version(self, url):
        return url.replace('http://www.tnr.com/', 'http://www.tnr.com/print/')

lorenzov · 02-03-2010, 08:38 PM

hi,
i haven't tried anything yet, but one thing i noticed is the comma after your last feed entry which means that array is expecting another entry and finds a blank instead.

Code:

('Seite Drei', 'http://szmobil.sueddeutsche.de/show.php?section=Seite+drei'),
('Meinungsseite', 'http://szmobil.sueddeutsche.de/show.php?section=Meinungsseite'),
('Panorama', 'http://szmobil.sueddeutsche.de/show.php?section=Panorama'),
('Feuilleton', 'http://szmobil.sueddeutsche.de/show.php?section=Feuilleton'),
('Medien', 'http://szmobil.sueddeutsche.de/show.php?section=Medien'),
('Wissen', 'http://szmobil.sueddeutsche.de/show.php?section=Wissen'),
('Wirtschaft', u'http://szmobil.sueddeutsche.de/show.php?section=Wirtschaft'),
('Sport', u'http://szmobil.sueddeutsche.de/show.php?section=Sport'),
('Muenchen-Bayern', u'http://szmobil.sueddeutsche.de/show.php?section=M%FCnchen%2FBayern'),

it might put you in the right direction for tonight!

bhandarisaurabh · 02-03-2010, 09:05 PM

may i request a recipe for http://sethgodin.typepad.com/

cix3 · 02-03-2010, 10:43 PM

Very basic recipe. Feel free to enhance in any way...

Code:

class SethGodin(BasicNewsRecipe):
    title = 'Seth Godins Blog'
    __author__ = 'cix3'
    language = 'en'
    description = 'Seth Godin - riffs on marketing, respect, and the ways ideas spread.'
    timefmt = ' [%b %d, %Y]'

    oldest_article = 30
    max_articles_per_feed = 100
    no_stylesheets = True

    remove_tags = [dict(name='script')]
    feeds = [('SethGodin', 'http://feeds.feedburner.com/typepad/sethsmainblog')]

MartynM · 02-04-2010, 07:32 AM

Hi Guys,

I love the Digital Spy site as it is a great source for anything to do with entertainment. It would be great to get it on my Kindle but I don't have a clue. I have looked at the recipe section and it may as well be written in Russian.

The site is www.digitalspy.co.uk any help and direction on how to get this as a newspaper download would be appreciated.

REGARDS

Martyn

02-04-2010, 07:32 AM	#1350
MartynM Junior Member Posts: 1 Karma: 10 Join Date: Feb 2010 Device: Kindle	Digital Spy Hi Guys, I love the Digital Spy site as it is a great source for anything to do with entertainment. It would be great to get it on my Kindle but I don't have a clue. I have looked at the recipe section and it may as well be written in Russian. The site is www.digitalspy.co.uk any help and direction on how to get this as a newspaper download would be appreciated. REGARDS Martyn

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Custom column read ?	pchrist7	Calibre	2	10-04-2010 02:52 AM
Archive for custom screensavers	sleeplessdave	Amazon Kindle	1	07-07-2010 12:33 PM
How to back up preferences and custom recipes?	greenapple	Calibre	3	03-29-2010 05:08 AM
Donations for Custom Recipes	ddavtian	Calibre	5	01-23-2010 04:54 PM
Help understanding custom recipes	andersent	Calibre	0	12-17-2009 02:37 PM

02-03-2010, 02:21 PM	#1342
Denny_ Member Posts: 12 Karma: 42 Join Date: Jan 2010 Device: Kindle	In trying to create a custom recipe I got as far as posting the feeds and getting the print version but I'm having trouble cleaning up the extra links at the bottome of each article. At the end of the article the HTML file looks like: </p> </div> <div class="print-logo"></div> <hr class="calibre3"/> <div class="print-logo"></div> <div class="print-logo"> <p class="calibre5"><a href="https://www.neodata.com/ITPS2.cgi?OrderType=Reply+Only&ItemCode=WSTD&a mp;iResponse=WSTD.NEW">Subscribe now to The Weekly Standard!</a></p> <p class="calibre5"><b class="calibre6">Get more from The Weekly Standard:</b> <a href="/feeds">Follow WeeklyStandard.com on RSS</a> and <a href="/newsletter/requestform.asp">sign-up for our free Newsletter.</a></p> <p class="calibre5"><a href="/tws/advertising/default.asp">Contact our advertising team</a> for advertising and sponsorship on WeeklyStandard.com or in <b class="calibre6">The Weekly Standard.</b></p> <p class="calibre5">Copyright 2010 Weekly Standard LLC.</p> </div> <hr class="calibre3"/> <div class="print-logo"><strong class="calibre6">Source URL:</strong> <a href="http://www.theweeklystandard.com/blogs/obama-halts-nasas-constellation-program">http://www.theweeklystandard.com/blogs/obama-halts-nasas-constellation-program</a></div> <div class="print-logo"></div> <div class="navbar1"> <hr class="calibre3"/> <p class="calibre7"> This article was downloaded by <b class="calibre6">calibre</b> from <a href="http://www.theweeklystandard.com/blogs/obama-halts-nasas-constellation-program">http://www.theweeklystandard.com/blogs/obama-halts-nasas-constellation-program</a> </p> <br class="print-logo"/><br class="print-logo"/> \| <a href="../index.html#article_0">Section menu</a> \| </div></body> </html> Can someone tell me how best to eliminate this? Thanks, Denny

02-03-2010, 03:11 PM	#1344
gafleh Junior Member Posts: 4 Karma: 10 Join Date: Dec 2008 Device: none	Hi Everybody ! I am New here but for 2 days I have been around this wonderfull site. May I request a help of recipe for http://www.islamqa.com/en/rss.xml Thank You Once Again

02-03-2010, 04:50 PM	#1345
exdream Junior Member Posts: 9 Karma: 10 Join Date: Jan 2010 Device: Sony PRS-505	Hi I try to make a recipe for http://szmobil.sueddeutsche.de/ This ist the code up to now (with which I get - IndexError: list index out of range -Error Code: 1). Am I on the right way with that? Can somebody please tell me what is wrong. ... def parse_index(self): feeds = [] for title, url in [('Politik', 'http://szmobil.sueddeutsche.de/show.php?section=Politik'), ('Seite Drei', 'http://szmobil.sueddeutsche.de/show.php?section=Seite+drei'), ('Meinungsseite', 'http://szmobil.sueddeutsche.de/show.php?section=Meinungsseite'), ('Panorama', 'http://szmobil.sueddeutsche.de/show.php?section=Panorama'), ('Feuilleton', 'http://szmobil.sueddeutsche.de/show.php?section=Feuilleton'), ('Medien', 'http://szmobil.sueddeutsche.de/show.php?section=Medien'), ('Wissen', 'http://szmobil.sueddeutsche.de/show.php?section=Wissen'), ('Wirtschaft', u'http://szmobil.sueddeutsche.de/show.php?section=Wirtschaft'), ('Sport', u'http://szmobil.sueddeutsche.de/show.php?section=Sport'), ('Muenchen-Bayern', u'http://szmobil.sueddeutsche.de/show.php?section=M%FCnchen%2FBayern'), ]: articles = self.nz_parse_section(url) if articles: feeds.append((title, articles)) return feeds def nz_parse_section(self, url): soup = self.index_to_soup(url) # div = soup.find(attrs={'class': 'col-300 categoryList'}) # date = div.find(attrs={'class': 'link-list-heading'}) current_articles = [] # for tag in date.findAllNext(attrs = {'class': ['linkList', 'link-list-heading']}): # if tag.get('class') == 'link-list-heading': # break for li in soup.findAll('li'): a = li.find('a', href = True) if a is None: continue title = self.tag_to_string(a) url = a.get('href', False) if not url or not title: continue # if url.startswith('/'): # url = 'http://www.nzherald.co.nz'+url self.log('\t\tFound article:', title) self.log('\t\t\t', url) current_articles.append({'title': title, 'url': url, 'description':'', 'date':''}) return current_articles

02-03-2010, 09:05 PM	#1348
bhandarisaurabh Enthusiast Posts: 49 Karma: 10 Join Date: Aug 2009 Device: none	may i request a recipe for http://sethgodin.typepad.com/

Advert

Advert