Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 02-02-2014, 04:16 PM   #1
jennie
Member
jennie began at the beginning.
 
Posts: 14
Karma: 10
Join Date: Jun 2010
Device: kindle 3
New recipe for Kathimerini (Greek newspaper)

Here's a first shot at a recipe for the revised Kathimerini.
It only downloads today's news. Kathimerini usually updates at around 12:00 from Tuesday till Saturday and on 18:00 on Sundays (Athens time).

Without photos (2,5MB for this Sunday's edition):
Spoiler:
Code:
from calibre.web.feeds.recipes import BasicNewsRecipe

class Kathimerini(BasicNewsRecipe):
    title                  = 'Kathimerini'
    __author__             = 'jenniepet'
    description            = 'News from Greece'
    max_articles_per_feed  = 100
    oldest_article         = 1
    publisher              = 'Kathimerini'
    category               = 'news, GR'
    language               = 'el'
    encoding               = 'utf-8'
    conversion_options     = { 'linearize_tables': True}
    no_stylesheets         = True
    remove_tags_before     = dict(id='site-body')
    remove_tags_after      = [dict(id='social')]
    remove_tags            = [dict(attrs={'class':['post-tools', 'edition edition_PRINT', 'clearing-featured-img']})]

#Categories In order of Appearance: Politics-1-2-3, Greece-1-2, World-1-2, People-Specials
#Greek Economy-1-2, Business, International Economy, Real Estate
#Environment, Science, Technology, Culture-1-2, Travel, Sport
    feeds = [(u'\u03A0\u03BF\u03BB\u03B9\u03C4\u03B9\u03BA\u03AE','http://www.kathimerini.gr/rss?i=news.el.search&q=&t=0&w=&c=&s=p&cat=2&type=&edition=PRINT&author=0&fromDate=&toDate='), 
(u'\u03A0\u03BF\u03BB\u03B9\u03C4\u03B9\u03BA\u03AE 2','http://www.kathimerini.gr/rss?i=news.el.search&q=&t=0&w=&c=&s=p&cat=2&type=&edition=PRINT&author=0&fromDate=&toDate=&page=1'), 
(u'\u03A0\u03BF\u03BB\u03B9\u03C4\u03B9\u03BA\u03AE 3','http://www.kathimerini.gr/rss?i=news.el.search&q=&t=0&w=&c=&s=p&cat=2&type=&edition=PRINT&author=0&fromDate=&toDate=&page=2'), 
(u'\u0395\u03BB\u03BB\u03AC\u03B4\u03B1','http://www.kathimerini.gr/rss?i=news.el.search&q=&t=0&w=&c=&s=p&cat=4&type=&edition=PRINT&author=0&fromDate=&toDate='),
(u'\u0395\u03BB\u03BB\u03AC\u03B4\u03B1 2','http://www.kathimerini.gr/rss?i=news.el.search&q=&t=0&w=&c=&s=p&cat=4&type=&edition=PRINT&author=0&fromDate=&toDate=&page=1'),
(u'\u039A\u03CC\u03C3\u03BC\u03BF\u03C2','http://www.kathimerini.gr/rss?i=news.el.search&q=&t=0&w=&c=&s=p&cat=5&type=&edition=PRINT&author=0&fromDate=&toDate='),
(u'\u039A\u03CC\u03C3\u03BC\u03BF\u03C2 2','http://www.kathimerini.gr/rss?i=news.el.search&q=&t=0&w=&c=&s=p&cat=5&type=&edition=PRINT&author=0&fromDate=&toDate=&page=1'),
(u'\u03A0\u03C1\u03CC\u03C3\u03C9\u03C0\u03B1','http://www.kathimerini.gr/rss?i=news.el.search&q=&t=0&w=&c=&s=p&cat=41&cat=42&cat=43&cat=24&cat=25&type=&edition=PRINT&author=0&fromDate=&toDate='),
(u'\u0395\u03BB\u03BB\u03B7\u03BD\u03B9\u03BA\u03AE \u039F\u03B9\u03BA\u03BF\u03BD\u03BF\u03BC\u03AF\u03B1','http://www.kathimerini.gr/rss?i=news.el.search&q=&t=0&w=&c=&s=p&cat=17&type=&edition=PRINT&author=0&fromDate=&toDate='),
(u'\u0395\u03BB\u03BB\u03B7\u03BD\u03B9\u03BA\u03AE \u039F\u03B9\u03BA\u03BF\u03BD\u03BF\u03BC\u03AF\u03B1 2','http://www.kathimerini.gr/rss?i=news.el.search&q=&t=0&w=&c=&s=p&cat=17&type=&edition=PRINT&author=0&fromDate=&toDate=&page=1'),
(u'\u0395\u03C0\u03B9\u03C7\u03B5\u03B9\u03C1\u03AE\u03C3\u03B5\u03B9\u03C2','http://www.kathimerini.gr/rss?i=news.el.search&q=&t=0&w=&c=&s=p&cat=18&type=&edition=PRINT&author=0&fromDate=&toDate='),
(u'\u0394\u03B9\u03B5\u03B8\u03BD\u03AE\u03C2 \u039F\u03B9\u03BA\u03BF\u03BD\u03BF\u03BC\u03AF\u03B1','http://www.kathimerini.gr/rss?i=news.el.search&q=&t=0&w=&c=&s=p&cat=19&type=&edition=PRINT&author=0&fromDate=&toDate='),
(u'Real Estate','http://www.kathimerini.gr/rss?i=news.el.search&q=&t=0&w=&c=&s=p&cat=21&type=&edition=PRINT&author=0&fromDate=&toDate='),
(u'\u03A0\u03B5\u03C1\u03B9\u03B2\u03AC\u03BB\u03BB\u03BF\u03BD','http://www.kathimerini.gr/rss?i=news.el.search&q=&t=0&w=&c=&s=p&cat=6&type=&edition=PRINT&author=0&fromDate=&toDate='),
(u'\u0395\u03C0\u03B9\u03C3\u03C4\u03AE\u03BC\u03B7','http://www.kathimerini.gr/rss?i=news.el.search&q=&t=0&w=&c=&s=p&cat=7&type=&edition=PRINT&author=0&fromDate=&toDate='),
(u'\u03A4\u03B5\u03C7\u03BD\u03BF\u03BB\u03BF\u03B3\u03AF\u03B1','http://www.kathimerini.gr/rss?i=news.el.search&q=&t=0&w=&c=&s=p&cat=47&cat=48&cat=49&cat=50&cat=51&type=&edition=PRINT&author=0&fromDate=&toDate='),
(u'\u03A0\u03BF\u03BB\u03B9\u03C4\u03B9\u03C3\u03BC\u03CC\u03C2','http://www.kathimerini.gr/rss?i=news.el.search&q=&t=0&w=&c=&s=p&cat=31&cat=32&cat=33&cat=34&cat=35&cat=36&cat=37&cat=38&cat=39&type=&edition=PRINT&author=0&fromDate=&toDate='),
(u'\u03A0\u03BF\u03BB\u03B9\u03C4\u03B9\u03C3\u03BC\u03CC\u03C2 2','http://www.kathimerini.gr/rss?i=news.el.search&q=&t=0&w=&c=&s=p&cat=31&cat=32&cat=33&cat=34&cat=35&cat=36&cat=37&cat=38&cat=39&type=&edition=PRINT&author=0&fromDate=&toDate=&page=1'),
(u'\u03A4\u03B1\u03BE\u03AF\u03B4\u03B9\u03B1','http://www.kathimerini.gr/rss?i=news.el.search&q=&t=0&w=&c=&s=p&cat=11&cat=10&cat=12&cat=14&cat=15&cat=13&type=&edition=PRINT&author=0&fromDate=&toDate='),
(u'\u0391\u03B8\u03BB\u03B7\u03C4\u03B9\u03C3\u03BC\u03CC\u03C2','http://www.kathimerini.gr/rss?i=news.el.search&q=&t=0&w=&c=&s=p&cat=3&type=&edition=PRINT&author=0&fromDate=&toDate=')
]

    def get_cover_url(self):
       import time
       return 'http://s.kathimerini.gr/resources/issue-cover/%s.jpg' %time.strftime('%d-%m-%Y')

With photos (10MB for this Sunday's edition):
Spoiler:
Code:
from calibre.web.feeds.recipes import BasicNewsRecipe

class Kathimerini(BasicNewsRecipe):
    title                  = 'Kathimerini'
    __author__             = 'jenniepet'
    description            = 'News from Greece'
    max_articles_per_feed  = 100
    oldest_article         = 1
    publisher              = 'Kathimerini'
    category               = 'news, GR'
    language               = 'el'
    encoding               = 'utf-8'
    conversion_options     = { 'linearize_tables': True}
    no_stylesheets         = True
    remove_tags_before     = dict(id='site-body')
    remove_tags_after      = [dict(id='social')]
    remove_tags            = [dict(attrs={'class':['post-tools', 'edition edition_PRINT']})]
#to remove images comment the line above and uncomment the line below
#    remove_tags            = [dict(attrs={'class':['post-tools', 'edition edition_PRINT', 'clearing-featured-img']})]

#Categories In order of Appearance: Politics-1-2-3, Greece-1-2, World-1-2, People-Specials
#Greek Economy-1-2, Business, International Economy, Real Estate
#Environment, Science, Technology, Culture-1-2, Travel, Sport
    feeds = [(u'\u03A0\u03BF\u03BB\u03B9\u03C4\u03B9\u03BA\u03AE','http://www.kathimerini.gr/rss?i=news.el.search&q=&t=0&w=&c=&s=p&cat=2&type=&edition=PRINT&author=0&fromDate=&toDate='), 
(u'\u03A0\u03BF\u03BB\u03B9\u03C4\u03B9\u03BA\u03AE 2','http://www.kathimerini.gr/rss?i=news.el.search&q=&t=0&w=&c=&s=p&cat=2&type=&edition=PRINT&author=0&fromDate=&toDate=&page=1'), 
(u'\u03A0\u03BF\u03BB\u03B9\u03C4\u03B9\u03BA\u03AE 3','http://www.kathimerini.gr/rss?i=news.el.search&q=&t=0&w=&c=&s=p&cat=2&type=&edition=PRINT&author=0&fromDate=&toDate=&page=2'), 
(u'\u0395\u03BB\u03BB\u03AC\u03B4\u03B1','http://www.kathimerini.gr/rss?i=news.el.search&q=&t=0&w=&c=&s=p&cat=4&type=&edition=PRINT&author=0&fromDate=&toDate='),
(u'\u0395\u03BB\u03BB\u03AC\u03B4\u03B1 2','http://www.kathimerini.gr/rss?i=news.el.search&q=&t=0&w=&c=&s=p&cat=4&type=&edition=PRINT&author=0&fromDate=&toDate=&page=1'),
(u'\u039A\u03CC\u03C3\u03BC\u03BF\u03C2','http://www.kathimerini.gr/rss?i=news.el.search&q=&t=0&w=&c=&s=p&cat=5&type=&edition=PRINT&author=0&fromDate=&toDate='),
(u'\u039A\u03CC\u03C3\u03BC\u03BF\u03C2 2','http://www.kathimerini.gr/rss?i=news.el.search&q=&t=0&w=&c=&s=p&cat=5&type=&edition=PRINT&author=0&fromDate=&toDate=&page=1'),
(u'\u03A0\u03C1\u03CC\u03C3\u03C9\u03C0\u03B1','http://www.kathimerini.gr/rss?i=news.el.search&q=&t=0&w=&c=&s=p&cat=41&cat=42&cat=43&cat=24&cat=25&type=&edition=PRINT&author=0&fromDate=&toDate='),
(u'\u0395\u03BB\u03BB\u03B7\u03BD\u03B9\u03BA\u03AE \u039F\u03B9\u03BA\u03BF\u03BD\u03BF\u03BC\u03AF\u03B1','http://www.kathimerini.gr/rss?i=news.el.search&q=&t=0&w=&c=&s=p&cat=17&type=&edition=PRINT&author=0&fromDate=&toDate='),
(u'\u0395\u03BB\u03BB\u03B7\u03BD\u03B9\u03BA\u03AE \u039F\u03B9\u03BA\u03BF\u03BD\u03BF\u03BC\u03AF\u03B1 2','http://www.kathimerini.gr/rss?i=news.el.search&q=&t=0&w=&c=&s=p&cat=17&type=&edition=PRINT&author=0&fromDate=&toDate=&page=1'),
(u'\u0395\u03C0\u03B9\u03C7\u03B5\u03B9\u03C1\u03AE\u03C3\u03B5\u03B9\u03C2','http://www.kathimerini.gr/rss?i=news.el.search&q=&t=0&w=&c=&s=p&cat=18&type=&edition=PRINT&author=0&fromDate=&toDate='),
(u'\u0394\u03B9\u03B5\u03B8\u03BD\u03AE\u03C2 \u039F\u03B9\u03BA\u03BF\u03BD\u03BF\u03BC\u03AF\u03B1','http://www.kathimerini.gr/rss?i=news.el.search&q=&t=0&w=&c=&s=p&cat=19&type=&edition=PRINT&author=0&fromDate=&toDate='),
(u'Real Estate','http://www.kathimerini.gr/rss?i=news.el.search&q=&t=0&w=&c=&s=p&cat=21&type=&edition=PRINT&author=0&fromDate=&toDate='),
(u'\u03A0\u03B5\u03C1\u03B9\u03B2\u03AC\u03BB\u03BB\u03BF\u03BD','http://www.kathimerini.gr/rss?i=news.el.search&q=&t=0&w=&c=&s=p&cat=6&type=&edition=PRINT&author=0&fromDate=&toDate='),
(u'\u0395\u03C0\u03B9\u03C3\u03C4\u03AE\u03BC\u03B7','http://www.kathimerini.gr/rss?i=news.el.search&q=&t=0&w=&c=&s=p&cat=7&type=&edition=PRINT&author=0&fromDate=&toDate='),
(u'\u03A4\u03B5\u03C7\u03BD\u03BF\u03BB\u03BF\u03B3\u03AF\u03B1','http://www.kathimerini.gr/rss?i=news.el.search&q=&t=0&w=&c=&s=p&cat=47&cat=48&cat=49&cat=50&cat=51&type=&edition=PRINT&author=0&fromDate=&toDate='),
(u'\u03A0\u03BF\u03BB\u03B9\u03C4\u03B9\u03C3\u03BC\u03CC\u03C2','http://www.kathimerini.gr/rss?i=news.el.search&q=&t=0&w=&c=&s=p&cat=31&cat=32&cat=33&cat=34&cat=35&cat=36&cat=37&cat=38&cat=39&type=&edition=PRINT&author=0&fromDate=&toDate='),
(u'\u03A0\u03BF\u03BB\u03B9\u03C4\u03B9\u03C3\u03BC\u03CC\u03C2 2','http://www.kathimerini.gr/rss?i=news.el.search&q=&t=0&w=&c=&s=p&cat=31&cat=32&cat=33&cat=34&cat=35&cat=36&cat=37&cat=38&cat=39&type=&edition=PRINT&author=0&fromDate=&toDate=&page=1'),
(u'\u03A4\u03B1\u03BE\u03AF\u03B4\u03B9\u03B1','http://www.kathimerini.gr/rss?i=news.el.search&q=&t=0&w=&c=&s=p&cat=11&cat=10&cat=12&cat=14&cat=15&cat=13&type=&edition=PRINT&author=0&fromDate=&toDate='),
(u'\u0391\u03B8\u03BB\u03B7\u03C4\u03B9\u03C3\u03BC\u03CC\u03C2','http://www.kathimerini.gr/rss?i=news.el.search&q=&t=0&w=&c=&s=p&cat=3&type=&edition=PRINT&author=0&fromDate=&toDate=')
]

    def get_cover_url(self):
       import time
       return 'http://s.kathimerini.gr/resources/issue-cover/%s.jpg' %time.strftime('%d-%m-%Y')

A few questions:
I don't know how to add the cover image. Can anyone help? The cover picture for yesterday, February 1st, 2014, was "http://s.kathimerini.gr/resources/issue-cover/01-02-2014.jpg"

I tried using only one RSS feed, but it wouldn't download more than 30 articles. So, I added &page=1 etc. to get the older articles. Is there a better way to do that?

Is there a way to include cartoons but not other images? Apparently, the only thing that differentiates them from other images is that cartoon pages include the class "article_SKETCH" in the <body> tag. I removed all images with:
Code:
    remove_tags            = [dict(attrs={'class':['clearing-featured-img']})]

Last edited by jennie; 02-06-2014 at 07:54 AM. Reason: Updated recipe code
jennie is offline   Reply With Quote
Old 02-02-2014, 09:39 PM   #2
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 45,337
Karma: 27182818
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Code:
def get_cover_url(self):
     import time
     return 'http://s.kathimerini.gr/resources/issue-cover/%s.jpg' %time.strftime('%d-%d-%Y')
You can concatenate rss feeds by re-implementing the parse_feeds() function in your recipe.

Implement preprocess_raw() in your recipe and replace clearing-featured-img by 'dont-remove-me' if the page is a comic page. Then remove tags wont affect it.
kovidgoyal is online now   Reply With Quote
Advert
Old 02-03-2014, 03:16 AM   #3
jennie
Member
jennie began at the beginning.
 
Posts: 14
Karma: 10
Join Date: Jun 2010
Device: kindle 3
Hi Kovid, thanks a lot for your reply.

I still can't get the cover to work. Here's my latest code:
Spoiler:
Code:
from calibre.web.feeds.recipes import BasicNewsRecipe

class Kathimerini(BasicNewsRecipe):
    title                  = 'Kathimerini'
    __author__             = 'jenniepet'
    description            = 'News from Greece'
    max_articles_per_feed  = 100
    oldest_article         = 2
    publisher              = 'Kathimerini'
    category               = 'news, GR'
    language               = 'el'
    encoding               = 'utf-8'
    conversion_options     = { 'linearize_tables': True}
    no_stylesheets         = True
    remove_tags_before     = dict(id='site-body')
    remove_tags_after      = [dict(id='social')]
    remove_tags            = [dict(attrs={'class':['clearing-featured-img', 'post-tools', 'edition edition_PRINT']})]
    feeds = [(u'1','http://www.kathimerini.gr/rss?i=news.el.search&q=&t=0&w=&c=&s=p&type=&edition=PRINT&author=0&fromDate=0&toDate=0'), 
(u'2','http://www.kathimerini.gr/rss?i=news.el.search&q=&t=0&w=&c=&s=p&type=&edition=PRINT&author=0&fromDate=0&toDate=0&page=1'), 
(u'3','http://www.kathimerini.gr/rss?i=news.el.search&q=&t=0&w=&c=&s=p&type=&edition=PRINT&author=0&fromDate=0&toDate=0&page=2'), 
(u'4','http://www.kathimerini.gr/rss?i=news.el.search&q=&t=0&w=&c=&s=p&type=&edition=PRINT&author=0&fromDate=0&toDate=0&page=3'), 
(u'5','http://www.kathimerini.gr/rss?i=news.el.search&q=&t=0&w=&c=&s=p&type=&edition=PRINT&author=0&fromDate=0&toDate=0&page=4')]

def get_cover_url(self):
     import time
     return 'http://s.kathimerini.gr/resources/issue-cover/02-%s.jpg' %time.strftime('%m-%Y')

I'm using 02 instead of %d for testing purposes, because there is no issue today.

I don't exactly know how to program in any language, so I'm having trouble using the rest of your advice. I don't think I want to try implementing parse_feed at this point, but I did try to read up on preprocess_raw_html, with no tangible results yet.
I'd really appreciate it if you could give me the complete fixed code.
I guess what I need to implement is something in the lines of:
Code:
if body contains the  class "article_SKETCH" (among others)
replace 'class':['clearing-featured-img'] with 'class':['do-not-remove']
jennie is offline   Reply With Quote
Old 02-03-2014, 03:21 AM   #4
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 45,337
Karma: 27182818
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
I'm afraid I dont have the time to write the code for you, for the cover, you need to use this:

Code:
    def get_cover_url(self):
       import time
       return 'http://s.kathimerini.gr/resources/issue-cover/%s.jpg' %time.strftime('%d-%d-%Y')
The indentation for get_cover must be correct, it must be at the same level as the other members of the class.
kovidgoyal is online now   Reply With Quote
Old 02-03-2014, 04:43 AM   #5
jennie
Member
jennie began at the beginning.
 
Posts: 14
Karma: 10
Join Date: Jun 2010
Device: kindle 3
The problem with the cover was the indentation. Thanx!

I have updated the code in the first post with this and a couple of other tags to remove. I might check the comics situation later.
If anyone else is interested in reading this newspaper and would like to give it a try, feel free.

As it is now, the code still gives a pretty clean result, so you could add it to the repository if there are no changes in, say, a week's time.

Last edited by jennie; 02-03-2014 at 04:51 AM.
jennie is offline   Reply With Quote
Advert
Old 02-06-2014, 08:01 AM   #6
jennie
Member
jennie began at the beginning.
 
Posts: 14
Karma: 10
Join Date: Jun 2010
Device: kindle 3
I updated the original recipe. It now downloads news in categories.
This issue persists:
Quote:
I tried using only one RSS feed, but it wouldn't download more than 30 articles. So, I added &page=1 etc. to get the older articles. Is there a better way to do that?
However, it only affects a couple of categories in the Sunday edition, so I'll leave it like that.
jennie is offline   Reply With Quote
Old 07-18-2014, 11:13 AM   #7
papadi
Junior Member
papadi began at the beginning.
 
Posts: 1
Karma: 10
Join Date: Jul 2014
Device: Kindle PaperWhite
Could you please tell me how this works!? I'm new here. What is a recipe? Is it something that will allow me to read my paper on my kindle? What do I need to do?
papadi is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
New Greek News Recipe (TVXS) hargikas Recipes 3 04-11-2013 04:14 PM
Recipe for Berria (Basque newspaper) arraintxo Recipes 2 04-23-2012 05:44 AM
Kathimerini recipe on Kindle 3: Only first page shows jennie Recipes 2 05-27-2011 04:06 AM
Request:Recipe for malayalam newspaper onenest Recipes 0 04-29-2011 05:32 AM
Adding recipe for Tamil Newspaper anthiyag Recipes 1 04-08-2011 03:18 PM


All times are GMT -4. The time now is 07:59 AM.


MobileRead.com is a privately owned, operated and funded community.