Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 03-23-2011, 02:35 PM   #1
ironcat
Junior Member
ironcat began at the beginning.
 
ironcat's Avatar
 
Posts: 5
Karma: 10
Join Date: Mar 2011
Location: Budapest, Hungary
Device: Kindle 3 Wi-Fi
Improved recipe for hungarian '168 óra'

Here is an improved recipe for '168 óra': excludes irrelevant parts
Spoiler:

Code:
import re
from calibre.web.feeds.recipes import BasicNewsRecipe

class hu168ora(BasicNewsRecipe):
    title                 = u'168 óra'
    __author__            = u'István Papp'
    description           = u'A 168 óra friss hírei'
    timefmt               = ' [%Y. %b. %d., %a.]'
    oldest_article        = 7
    language              = 'hu'

    max_articles_per_feed = 100
    no_stylesheets        = True
    use_embedded_content  = False
    encoding              = 'utf8'
    publisher             = u'Telegráf Kiadó'
    category              = u'news, hírek, 168'
    extra_css             = 'body{ font-family: Verdana,Helvetica,Arial,sans-serif } .lead{font-weight: bold} h2{text-align: center; text-transform: uppercase} '
    preprocess_regexps    = [(re.compile(r'<!--.*?-->', re.DOTALL), lambda m: '')]
    remove_tags_before    = dict(id='cikk_fejlec')
    remove_tags_after     = dict(id='szoveg')
    remove_tags           = [dict(id='box_toolbar')]
    remove_javascript     = True
    remove_empty_feeds    = True


    feeds = [
              (u'Itthon', u'http://www.168ora.hu/static/rss/cikkek_itthon.xml')
             ,(u'Glóbusz', u'http://www.168ora.hu/static/rss/cikkek_globusz.xml')
             ,(u'Punch', u'http://www.168ora.hu/static/rss/cikkek_punch.xml')
             ,(u'Arte', u'http://www.168ora.hu/static/rss/cikkek_arte.xml')
             ,(u'Buxa', u'http://www.168ora.hu/static/rss/cikkek_buxa.xml')
             ,(u'Sebesség', u'http://www.168ora.hu/static/rss/cikkek_sebesseg.xml')
             ,(u'Tudás', u'http://www.168ora.hu/static/rss/cikkek_tudas.xml')
             ,(u'Sport', u'http://www.168ora.hu/static/rss/cikkek_sport.xml')
             ,(u'Vélemény', u'http://www.168ora.hu/static/rss/cikkek_velemeny.xml')
             ,(u'Dolce Vita', u'http://www.168ora.hu/static/rss/cikkek_dolcevita.xml')
             ,(u'Rádió', u'http://www.168ora.hu/static/rss/radio.xml')
            ]

    def print_version(self, url):
        url += '?print=1'
        return url
ironcat is offline   Reply With Quote
Old 03-24-2011, 06:53 AM   #2
ironcat
Junior Member
ironcat began at the beginning.
 
ironcat's Avatar
 
Posts: 5
Karma: 10
Join Date: Mar 2011
Location: Budapest, Hungary
Device: Kindle 3 Wi-Fi
Duplicated text

The new variant removes the duplicated text from the articles:
Spoiler:

Code:
import re
from calibre.web.feeds.recipes import BasicNewsRecipe

class hu168ora(BasicNewsRecipe):
    title                 = u'168 óra'
    __author__            = u'István Papp'
    description           = u'A 168 óra friss hírei'
    timefmt               = ' [%Y. %b. %d., %a.]'
    oldest_article        = 7
    language              = 'hu'

    max_articles_per_feed = 100
    no_stylesheets        = True
    use_embedded_content  = False
    encoding              = 'utf8'
    publisher             = u'Telegráf Kiadó'
    category              = u'news, hírek, 168'
    extra_css             = 'body{ font-family: Verdana,Helvetica,Arial,sans-serif }'
    preprocess_regexps    = [(re.compile(r'<!--.*?-->', re.DOTALL), lambda m: '')]
    keep_only_tags        = [
                              dict(id='cikk_fejlec')
                             ,dict(id='cikk_torzs')
                            ]
#    remove_tags_before    = dict(id='cikk_fejlec')
#    remove_tags_after     = dict(id='szoveg')
    remove_tags           = [
                              dict(id='box_toolbar')
                             ,dict(id='text')
                            ]
    remove_javascript     = True
    remove_empty_feeds    = True


    feeds = [
              (u'Itthon', u'http://www.168ora.hu/static/rss/cikkek_itthon.xml')
             ,(u'Glóbusz', u'http://www.168ora.hu/static/rss/cikkek_globusz.xml')
             ,(u'Punch', u'http://www.168ora.hu/static/rss/cikkek_punch.xml')
             ,(u'Arte', u'http://www.168ora.hu/static/rss/cikkek_arte.xml')
             ,(u'Buxa', u'http://www.168ora.hu/static/rss/cikkek_buxa.xml')
             ,(u'Sebesség', u'http://www.168ora.hu/static/rss/cikkek_sebesseg.xml')
             ,(u'Tudás', u'http://www.168ora.hu/static/rss/cikkek_tudas.xml')
             ,(u'Sport', u'http://www.168ora.hu/static/rss/cikkek_sport.xml')
             ,(u'Vélemény', u'http://www.168ora.hu/static/rss/cikkek_velemeny.xml')
             ,(u'Dolce Vita', u'http://www.168ora.hu/static/rss/cikkek_dolcevita.xml')
#             ,(u'Rádió', u'http://www.168ora.hu/static/rss/radio.xml')
            ]

    def print_version(self, url):
        url += '?print=1'
        return url
ironcat is offline   Reply With Quote
Advert
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
A Hungarian language weekly Élet és Irodalom Cajus Marcius Recipes 1 12-20-2011 04:44 PM
Recipe for hungarian HVG Online ironcat Recipes 0 03-23-2011 03:34 AM
Two new hungarian recipe hiperlink Recipes 0 03-15-2011 10:37 AM
Improved recipe for Le Monde veezh Recipes 0 02-25-2011 04:14 AM
Kia ora from New Zealand vaMuteti Introduce Yourself 1 01-03-2008 10:44 AM


All times are GMT -4. The time now is 05:40 AM.


MobileRead.com is a privately owned, operated and funded community.