Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 03-02-2011, 11:21 AM   #1
chewi
Member
chewi has learned how to read e-bookschewi has learned how to read e-bookschewi has learned how to read e-bookschewi has learned how to read e-bookschewi has learned how to read e-bookschewi has learned how to read e-bookschewi has learned how to read e-books
 
chewi's Avatar
 
Posts: 14
Karma: 822
Join Date: Nov 2010
Device: sony prs-650
Arrow RBC.ru recipe

Hello.

I've did recipe for RBC.ru:
Spoiler:
class AdvancedUserRecipe1286819935(BasicNewsRecipe):
title = u'RBC.ru'
__author__ = 'A. Chewi'
oldest_article = 7
max_articles_per_feed = 100
no_stylesheets = True
use_embedded_content = False
conversion_options = {'linearize_tables' : True}
remove_attributes = ['style']
language = 'ru'
timefmt = ' [%a, %d %b, %Y]'

keep_only_tags = [dict(name='h2', attrs={}),
dict(name='div', attrs={'class': 'box _ga1_on_'}),
dict(name='h1', attrs={'class': 'news_section'}),
dict(name='div', attrs={'class': 'news_body dotted_border_bottom'}),
dict(name='table', attrs={'class': 'newsBody'}),
dict(name='h2', attrs={'class': 'black'})]

feeds = [(u'Главные новости', u'http://static.feed.rbc.ru/rbc/internal/rss.rbc.ru/rbc.ru/mainnews.rss'),
(u'Политика', u'http://static.feed.rbc.ru/rbc/internal/rss.rbc.ru/rbc.ru/politics.rss'),
(u'Экономика', u'http://static.feed.rbc.ru/rbc/internal/rss.rbc.ru/rbc.ru/economics.rss'),
(u'Общество', u'http://static.feed.rbc.ru/rbc/internal/rss.rbc.ru/rbc.ru/society.rss'),
(u'Происшествия', u'http://static.feed.rbc.ru/rbc/internal/rss.rbc.ru/rbc.ru/incidents.rss'),
(u'Финансовые новости Quote.rbc.ru', u'http://static.feed.rbc.ru/rbc/internal/rss.rbc.ru/quote.ru/mainnews.rss')]


remove_tags = [dict(name='div', attrs={'class': "video-frame"}),
dict(name='div', attrs={'class': "photo-container videoContainer videoSWFLinks videoPreviewSlideContainer notes"}),
dict(name='div', attrs={'class': "notes"}),
dict(name='div', attrs={'class': "publinks"}),
dict(name='a', attrs={'class': "print"}),
dict(name='div', attrs={'class': "photo-report_new notes newslider"}),
dict(name='div', attrs={'class': "videoContainer"}),
dict(name='div', attrs={'class': "videoPreviewSlideContainer"}),
dict(name='a', attrs={'class': "videoPreviewContainer"}),
dict(name='a', attrs={'class': "red"}),]

def preprocess_html(self, soup):
for alink in soup.findAll('a'):
if alink.string is not None:
tstr = alink.string
alink.replaceWith(tstr)
return soup

def print_version(self, url):
return url + '?print=true'

It works good enough, but maybe experts can bring any remarks or offers about the code?
Thanks.
chewi is offline   Reply With Quote
Old 03-14-2011, 08:05 AM   #2
chewi
Member
chewi has learned how to read e-bookschewi has learned how to read e-bookschewi has learned how to read e-bookschewi has learned how to read e-bookschewi has learned how to read e-bookschewi has learned how to read e-bookschewi has learned how to read e-books
 
chewi's Avatar
 
Posts: 14
Karma: 822
Join Date: Nov 2010
Device: sony prs-650
Post

Here's updated recipe for RBC.ru (added cover image, description and some other tiny changes)


Spoiler:
Here's updated recipe for RBC.ru (added cover image, description and some other tiny changes)
# -*- coding: utf-8 -*-

from calibre.web.feeds.news import BasicNewsRecipe

class RBC_ru(BasicNewsRecipe):
title = u'RBC.ru'
__author__ = 'A. Chewi'
description = 'Российское информационное агентство «РосБизнесКонсалтинг» (РБК) - ленты новостей политики, экономики и финансов, аналитические материалы, комментарии и прогнозы, тематические статьи'
needs_subscription = False
cover_url = 'http://pics.rbc.ru/img/fp_v4/skin/img/logo.gif'
cover_margins = (80, 160, '#ffffff')
oldest_article = 10
max_articles_per_feed = 50
summary_length = 200
remove_empty_feeds = True
no_stylesheets = True
remove_javascript = True
use_embedded_content = False
conversion_options = {'linearize_tables' : True}
language = 'ru'
timefmt = ' [%a, %d %b, %Y]'

feeds = [(u'Главные новости', u'http://static.feed.rbc.ru/rbc/internal/rss.rbc.ru/rbc.ru/mainnews.rss'),
(u'Политика', u'http://static.feed.rbc.ru/rbc/internal/rss.rbc.ru/rbc.ru/politics.rss'),
(u'Экономика', u'http://static.feed.rbc.ru/rbc/internal/rss.rbc.ru/rbc.ru/economics.rss'),
(u'Общество', u'http://static.feed.rbc.ru/rbc/internal/rss.rbc.ru/rbc.ru/society.rss'),
(u'Происшествия', u'http://static.feed.rbc.ru/rbc/internal/rss.rbc.ru/rbc.ru/incidents.rss'),
(u'Финансовые новости Quote.rbc.ru', u'http://static.feed.rbc.ru/rbc/internal/rss.rbc.ru/quote.ru/mainnews.rss')]

keep_only_tags = [dict(name='h2', attrs={}),
dict(name='div', attrs={'class': 'box _ga1_on_'}),
dict(name='h1', attrs={'class': 'news_section'}),
dict(name='div', attrs={'class': 'news_body dotted_border_bottom'}),
dict(name='table', attrs={'class': 'newsBody'}),
dict(name='h2', attrs={'class': 'black'})]

remove_tags = [dict(name='div', attrs={'class': "video-frame"}),
dict(name='div', attrs={'class': "photo-container videoContainer videoSWFLinks videoPreviewSlideContainer notes"}),
dict(name='div', attrs={'class': "notes"}),
dict(name='div', attrs={'class': "publinks"}),
dict(name='a', attrs={'class': "print"}),
dict(name='div', attrs={'class': "photo-report_new notes newslider"}),
dict(name='div', attrs={'class': "videoContainer"}),
dict(name='div', attrs={'class': "videoPreviewSlideContainer"}),
dict(name='a', attrs={'class': "videoPreviewContainer"}),
dict(name='a', attrs={'class': "red"}),]

def preprocess_html(self, soup):
for alink in soup.findAll('a'):
if alink.string is not None:
tstr = alink.string
alink.replaceWith(tstr)
return soup

def print_version(self, url):
return url + '?print=true'
Attached Files
File Type: zip rbc_ru.zip (1.6 KB, 112 views)
chewi is offline   Reply With Quote
Advert
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Recipe works when mocked up as Python file, fails when converted to Recipe ode Recipes 7 09-04-2011 04:57 AM
Recipes for news.tut.by and rbc.ru: help plz chewi Recipes 0 02-21-2011 06:19 AM


All times are GMT -4. The time now is 12:55 PM.


MobileRead.com is a privately owned, operated and funded community.