Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 10-28-2010, 02:02 AM   #1
motorro
Junior Member
motorro began at the beginning.
 
Posts: 5
Karma: 10
Join Date: Oct 2010
Location: Russia, Moscow
Device: Kindle v3
Talking Lenta.ru recipe

Hello!

I've created a Lenta.ru recipe.

The ugly lenta markup is filtered out.
Uses main lenta feed. Articles divided by category.

Don't beat too hard - it's my first Calibre/Python experience :O)

Spoiler:
Code:
#!/usr/bin/env  python

'''
Lenta.ru
'''

from calibre.web.feeds.feedparser import parse
from calibre.ebooks.BeautifulSoup import BeautifulSoup, NavigableString, Tag
import re

class LentaRURecipe(BasicNewsRecipe):
    title = u'Lenta.ru: Новости'
    __author__ = 'Nikolai Kotchetkov'
    publisher = 'lenta.ru'
    category = 'news, Russia'
    description = u'Ежедневная интернет-газета. Новости со всего мира на русском языке'
    oldest_article = 3
    max_articles_per_feed = 100
    
    masthead_url = u'http://img.lenta.ru/i/logowrambler.gif'
    cover_url = u'http://img.lenta.ru/i/logowrambler.gif'

    #Add feed names if you want them to be sorted (feeds of this list appear first)
    sortOrder = [u'_default', u'В России', u'б.СССР', u'В мире']

    encoding = 'cp1251'
    language = 'ru'
    no_stylesheets = True
    remove_javascript = True
    recursions = 0
    
    conversion_options = {
                          'comment'   : description
                        , 'tags'      : category
                        , 'publisher' : publisher
                        , 'language'  : language
                        }
    
	
    keep_only_tags = [dict(name='td', attrs={'class':['statya','content']})]
	
    remove_tags_after = [dict(name='p', attrs={'class':'links'}), dict(name='div', attrs={'id':'readers-block'})]
	
    remove_tags = [dict(name='table', attrs={'class':['vrezka','content']}), dict(name='div', attrs={'class':'b240'}), dict(name='div', attrs={'id':'readers-block'}), dict(name='p', attrs={'class':'links'})]
	
    feeds = [u'http://lenta.ru/rss/']	
    
    extra_css = 'h1 {font-size: 1.2em; margin: 0em 0em 0em 0em;} h2 {font-size: 1.0em; margin: 0em 0em 0em 0em;} h3 {font-size: 0.8em; margin: 0em 0em 0em 0em;}'  

    def parse_index(self):
        feedSource = self.index_to_soup(self.feeds[0])

        try:
            feedData = parse(self.feeds[0])
            if not feedData:
                raise NotImplementedError
            self.log("parse_index: Feed loaded successfully.")
            if feedData.feed.has_key('title'):
                self.title = feedData.feed.title
                self.log("parse_index: Title updated to: ", self.title)
            if feedData.feed.has_key('image'):
                self.log("HAS IMAGE!!!!")
			
            def get_virtual_feed_articles(feed):
                if feeds.has_key(feed):
                    return feeds[feed][1]
                self.log("Adding new feed: ", feed)
                articles = []
                feeds[feed] = (feed, articles)
                return articles
            
            feeds = {}
            
            #Iterate feed items and distribute articles using tags
            for item in feedData.entries:
                link = item.get('link', '');
                title = item.get('title', '');
                if '' == link or '' == title:
                    continue
                article = {'title':title, 'url':link, 'description':item.get('description', ''), 'date':item.get('date', ''), 'content':''};
                if not item.has_key('tags'):
                    get_virtual_feed_articles('_default').append(article)
                    continue
                for tag in item.tags:
                    addedToDefault = False
                    term = tag.get('term', '')
                    if '' == term:
                        if (not addedToDefault):
                            get_virtual_feed_articles('_default').append(article)
                        continue
                    get_virtual_feed_articles(term).append(article)
                
            #Get feed list
            #Select sorted feeds first of all
            result = []
            for feedName in self.sortOrder:
                if (not feeds.has_key(feedName)): continue
                result.append(feeds[feedName])
                del feeds[feedName]
            result = result + feeds.values()
            
            return result 
            
        except Exception, err:
            self.log(err)
            raise NotImplementedError
		
    def preprocess_html(self, soup):
        return self.adeify_images(soup)

    def postprocess_html(self, soup, first_fetch):
        #self.log('Original: ', soup.prettify())
        
        contents = Tag(soup, 'div')
        
        #Extract tags with given attributes
        extractElements = {'div' : [{'id' : 'readers-block'}]}

        #Remove all elements that were not extracted before
        for tag, attrs in extractElements.iteritems():
            for attr in attrs:
                garbage = soup.findAll(tag, attr)
                if garbage:
                    for pieceOfGarbage in garbage:
                        pieceOfGarbage.extract()
        
        #Find article text using header
        #and add all elements to contents
        element = soup.find({'h1' : True, 'h2' : True})
        if (element):
            element.name = 'h1'
        while element:
            nextElement = element.nextSibling
            element.extract()
            contents.insert(len(contents.contents), element)
            element = nextElement

        #Place article date after header
        dates = soup.findAll(text=re.compile('\d{2}\.\d{2}\.\d{4}, \d{2}:\d{2}:\d{2}'))
        if dates:
            for date in dates:
                for string in date:
                    parent = date.parent
                    if (parent and isinstance(parent, Tag) and 'div' == parent.name and 'dt' == parent['class']):
                        #Date div found
                        parent.extract()
                        parent['style'] = 'font-size: 0.5em; color: gray; font-family: monospace;'
                        contents.insert(1, parent)
                        break
                        
        #Place article picture after date
        pic = soup.find('img')
        if pic:
            picDiv = Tag(soup, 'div')
            picDiv['style'] = 'width: 100%; text-align: center;'
            pic.extract()
            picDiv.insert(0, pic)
            title = pic.get('title', None)
            if title:
                titleDiv = Tag(soup, 'div')
                titleDiv['style'] = 'font-size: 0.5em;'
                titleDiv.insert(0, title)
                picDiv.insert(1, titleDiv)
            contents.insert(2, picDiv)
                
        body = soup.find('td', {'class':['statya','content']})
        if body:
            body.replaceWith(contents)
        
        #self.log('Result: ', soup.prettify())
        return soup
motorro is offline   Reply With Quote
Old 10-30-2010, 03:25 PM   #2
vega-m
Junior Member
vega-m began at the beginning.
 
Posts: 2
Karma: 10
Join Date: Oct 2010
Device: B&N Nook
Thanks a lot! There is really a shortage of recipes for russian users of calibre and this one works perfectly! I hope there are more recipes to come from you)
vega-m is offline   Reply With Quote
Old 10-30-2010, 05:24 PM   #3
desertgrandma
Enjoying the show....
desertgrandma ought to be getting tired of karma fortunes by now.desertgrandma ought to be getting tired of karma fortunes by now.desertgrandma ought to be getting tired of karma fortunes by now.desertgrandma ought to be getting tired of karma fortunes by now.desertgrandma ought to be getting tired of karma fortunes by now.desertgrandma ought to be getting tired of karma fortunes by now.desertgrandma ought to be getting tired of karma fortunes by now.desertgrandma ought to be getting tired of karma fortunes by now.desertgrandma ought to be getting tired of karma fortunes by now.desertgrandma ought to be getting tired of karma fortunes by now.desertgrandma ought to be getting tired of karma fortunes by now.
 
desertgrandma's Avatar
 
Posts: 14,270
Karma: 10462841
Join Date: Jun 2008
Location: Arizona
Device: A K1, Kindle Paperwhite, an Ipod, IPad2, Iphone, an Ipad Mini & macAir
Welcome to MobileRead, vega-m and motorro

Consider using the "introduce yourself" link below....this is a great place to learn and find like minded members.
desertgrandma is offline   Reply With Quote
Old 11-09-2010, 04:13 AM   #4
motorro
Junior Member
motorro began at the beginning.
 
Posts: 5
Karma: 10
Join Date: Oct 2010
Location: Russia, Moscow
Device: Kindle v3
Quote:
Originally Posted by vega-m View Post
Thanks a lot! There is really a shortage of recipes for russian users of calibre and this one works perfectly! I hope there are more recipes to come from you)
Here is Ведомости
motorro is offline   Reply With Quote
Old 01-16-2011, 05:33 PM   #5
Good Hedgehog
Junior Member
Good Hedgehog began at the beginning.
 
Posts: 4
Karma: 10
Join Date: Oct 2010
Location: Saint Petersburg, Russia
Device: Amazon Kindle 3
Для Kindle 3G
Лента http://pda.lenta.ru/
РБК http://pda.rbc.ru/
Newsru http://palm.newsru.com/
РИА http://pda.rian.ru/
PDA сайты http://izbranoe.narod.ru/Kindle.html
Good Hedgehog is offline   Reply With Quote
Reply

Tags
lenta.ru, news, russia


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
I need some help with a recipe jefferson_frantz Recipes 14 11-22-2010 02:06 PM
New recipe kiklop74 Recipes 0 10-01-2010 02:42 PM
Recipe Help lrain5 Calibre 3 05-09-2010 10:42 PM
Recipe Help hellonewman Calibre 1 01-23-2010 03:45 AM
Recipe Help Please estral Calibre 1 06-11-2009 02:35 PM


All times are GMT -4. The time now is 02:13 PM.


MobileRead.com is a privately owned, operated and funded community.