New recipe for Kathimerini (Greek newspaper)

jennie · 02-02-2014, 04:16 PM

Here's a first shot at a recipe for the revised Kathimerini.
It only downloads today's news. Kathimerini usually updates at around 12:00 from Tuesday till Saturday and on 18:00 on Sundays (Athens time).

Without photos (2,5MB for this Sunday's edition):

Spoiler:

Code:

from calibre.web.feeds.recipes import BasicNewsRecipe

class Kathimerini(BasicNewsRecipe):
    title                  = 'Kathimerini'
    __author__             = 'jenniepet'
    description            = 'News from Greece'
    max_articles_per_feed  = 100
    oldest_article         = 1
    publisher              = 'Kathimerini'
    category               = 'news, GR'
    language               = 'el'
    encoding               = 'utf-8'
    conversion_options     = { 'linearize_tables': True}
    no_stylesheets         = True
    remove_tags_before     = dict(id='site-body')
    remove_tags_after      = [dict(id='social')]
    remove_tags            = [dict(attrs={'class':['post-tools', 'edition edition_PRINT', 'clearing-featured-img']})]

#Categories In order of Appearance: Politics-1-2-3, Greece-1-2, World-1-2, People-Specials
#Greek Economy-1-2, Business, International Economy, Real Estate
#Environment, Science, Technology, Culture-1-2, Travel, Sport
    feeds = [(u'\u03A0\u03BF\u03BB\u03B9\u03C4\u03B9\u03BA\u03AE','http://www.kathimerini.gr/rss?i=news.el.search&q=&t=0&w=&c=&s=p&cat=2&type=&edition=PRINT&author=0&fromDate=&toDate='), 
(u'\u03A0\u03BF\u03BB\u03B9\u03C4\u03B9\u03BA\u03AE 2','http://www.kathimerini.gr/rss?i=news.el.search&q=&t=0&w=&c=&s=p&cat=2&type=&edition=PRINT&author=0&fromDate=&toDate=&page=1'), 
(u'\u03A0\u03BF\u03BB\u03B9\u03C4\u03B9\u03BA\u03AE 3','http://www.kathimerini.gr/rss?i=news.el.search&q=&t=0&w=&c=&s=p&cat=2&type=&edition=PRINT&author=0&fromDate=&toDate=&page=2'), 
(u'\u0395\u03BB\u03BB\u03AC\u03B4\u03B1','http://www.kathimerini.gr/rss?i=news.el.search&q=&t=0&w=&c=&s=p&cat=4&type=&edition=PRINT&author=0&fromDate=&toDate='),
(u'\u0395\u03BB\u03BB\u03AC\u03B4\u03B1 2','http://www.kathimerini.gr/rss?i=news.el.search&q=&t=0&w=&c=&s=p&cat=4&type=&edition=PRINT&author=0&fromDate=&toDate=&page=1'),
(u'\u039A\u03CC\u03C3\u03BC\u03BF\u03C2','http://www.kathimerini.gr/rss?i=news.el.search&q=&t=0&w=&c=&s=p&cat=5&type=&edition=PRINT&author=0&fromDate=&toDate='),
(u'\u039A\u03CC\u03C3\u03BC\u03BF\u03C2 2','http://www.kathimerini.gr/rss?i=news.el.search&q=&t=0&w=&c=&s=p&cat=5&type=&edition=PRINT&author=0&fromDate=&toDate=&page=1'),
(u'\u03A0\u03C1\u03CC\u03C3\u03C9\u03C0\u03B1','http://www.kathimerini.gr/rss?i=news.el.search&q=&t=0&w=&c=&s=p&cat=41&cat=42&cat=43&cat=24&cat=25&type=&edition=PRINT&author=0&fromDate=&toDate='),
(u'\u0395\u03BB\u03BB\u03B7\u03BD\u03B9\u03BA\u03AE \u039F\u03B9\u03BA\u03BF\u03BD\u03BF\u03BC\u03AF\u03B1','http://www.kathimerini.gr/rss?i=news.el.search&q=&t=0&w=&c=&s=p&cat=17&type=&edition=PRINT&author=0&fromDate=&toDate='),
(u'\u0395\u03BB\u03BB\u03B7\u03BD\u03B9\u03BA\u03AE \u039F\u03B9\u03BA\u03BF\u03BD\u03BF\u03BC\u03AF\u03B1 2','http://www.kathimerini.gr/rss?i=news.el.search&q=&t=0&w=&c=&s=p&cat=17&type=&edition=PRINT&author=0&fromDate=&toDate=&page=1'),
(u'\u0395\u03C0\u03B9\u03C7\u03B5\u03B9\u03C1\u03AE\u03C3\u03B5\u03B9\u03C2','http://www.kathimerini.gr/rss?i=news.el.search&q=&t=0&w=&c=&s=p&cat=18&type=&edition=PRINT&author=0&fromDate=&toDate='),
(u'\u0394\u03B9\u03B5\u03B8\u03BD\u03AE\u03C2 \u039F\u03B9\u03BA\u03BF\u03BD\u03BF\u03BC\u03AF\u03B1','http://www.kathimerini.gr/rss?i=news.el.search&q=&t=0&w=&c=&s=p&cat=19&type=&edition=PRINT&author=0&fromDate=&toDate='),
(u'Real Estate','http://www.kathimerini.gr/rss?i=news.el.search&q=&t=0&w=&c=&s=p&cat=21&type=&edition=PRINT&author=0&fromDate=&toDate='),
(u'\u03A0\u03B5\u03C1\u03B9\u03B2\u03AC\u03BB\u03BB\u03BF\u03BD','http://www.kathimerini.gr/rss?i=news.el.search&q=&t=0&w=&c=&s=p&cat=6&type=&edition=PRINT&author=0&fromDate=&toDate='),
(u'\u0395\u03C0\u03B9\u03C3\u03C4\u03AE\u03BC\u03B7','http://www.kathimerini.gr/rss?i=news.el.search&q=&t=0&w=&c=&s=p&cat=7&type=&edition=PRINT&author=0&fromDate=&toDate='),
(u'\u03A4\u03B5\u03C7\u03BD\u03BF\u03BB\u03BF\u03B3\u03AF\u03B1','http://www.kathimerini.gr/rss?i=news.el.search&q=&t=0&w=&c=&s=p&cat=47&cat=48&cat=49&cat=50&cat=51&type=&edition=PRINT&author=0&fromDate=&toDate='),
(u'\u03A0\u03BF\u03BB\u03B9\u03C4\u03B9\u03C3\u03BC\u03CC\u03C2','http://www.kathimerini.gr/rss?i=news.el.search&q=&t=0&w=&c=&s=p&cat=31&cat=32&cat=33&cat=34&cat=35&cat=36&cat=37&cat=38&cat=39&type=&edition=PRINT&author=0&fromDate=&toDate='),
(u'\u03A0\u03BF\u03BB\u03B9\u03C4\u03B9\u03C3\u03BC\u03CC\u03C2 2','http://www.kathimerini.gr/rss?i=news.el.search&q=&t=0&w=&c=&s=p&cat=31&cat=32&cat=33&cat=34&cat=35&cat=36&cat=37&cat=38&cat=39&type=&edition=PRINT&author=0&fromDate=&toDate=&page=1'),
(u'\u03A4\u03B1\u03BE\u03AF\u03B4\u03B9\u03B1','http://www.kathimerini.gr/rss?i=news.el.search&q=&t=0&w=&c=&s=p&cat=11&cat=10&cat=12&cat=14&cat=15&cat=13&type=&edition=PRINT&author=0&fromDate=&toDate='),
(u'\u0391\u03B8\u03BB\u03B7\u03C4\u03B9\u03C3\u03BC\u03CC\u03C2','http://www.kathimerini.gr/rss?i=news.el.search&q=&t=0&w=&c=&s=p&cat=3&type=&edition=PRINT&author=0&fromDate=&toDate=')
]

    def get_cover_url(self):
       import time
       return 'http://s.kathimerini.gr/resources/issue-cover/%s.jpg' %time.strftime('%d-%m-%Y')

With photos (10MB for this Sunday's edition):

Spoiler:

Code:

from calibre.web.feeds.recipes import BasicNewsRecipe

class Kathimerini(BasicNewsRecipe):
    title                  = 'Kathimerini'
    __author__             = 'jenniepet'
    description            = 'News from Greece'
    max_articles_per_feed  = 100
    oldest_article         = 1
    publisher              = 'Kathimerini'
    category               = 'news, GR'
    language               = 'el'
    encoding               = 'utf-8'
    conversion_options     = { 'linearize_tables': True}
    no_stylesheets         = True
    remove_tags_before     = dict(id='site-body')
    remove_tags_after      = [dict(id='social')]
    remove_tags            = [dict(attrs={'class':['post-tools', 'edition edition_PRINT']})]
#to remove images comment the line above and uncomment the line below
#    remove_tags            = [dict(attrs={'class':['post-tools', 'edition edition_PRINT', 'clearing-featured-img']})]

#Categories In order of Appearance: Politics-1-2-3, Greece-1-2, World-1-2, People-Specials
#Greek Economy-1-2, Business, International Economy, Real Estate
#Environment, Science, Technology, Culture-1-2, Travel, Sport
    feeds = [(u'\u03A0\u03BF\u03BB\u03B9\u03C4\u03B9\u03BA\u03AE','http://www.kathimerini.gr/rss?i=news.el.search&q=&t=0&w=&c=&s=p&cat=2&type=&edition=PRINT&author=0&fromDate=&toDate='), 
(u'\u03A0\u03BF\u03BB\u03B9\u03C4\u03B9\u03BA\u03AE 2','http://www.kathimerini.gr/rss?i=news.el.search&q=&t=0&w=&c=&s=p&cat=2&type=&edition=PRINT&author=0&fromDate=&toDate=&page=1'), 
(u'\u03A0\u03BF\u03BB\u03B9\u03C4\u03B9\u03BA\u03AE 3','http://www.kathimerini.gr/rss?i=news.el.search&q=&t=0&w=&c=&s=p&cat=2&type=&edition=PRINT&author=0&fromDate=&toDate=&page=2'), 
(u'\u0395\u03BB\u03BB\u03AC\u03B4\u03B1','http://www.kathimerini.gr/rss?i=news.el.search&q=&t=0&w=&c=&s=p&cat=4&type=&edition=PRINT&author=0&fromDate=&toDate='),
(u'\u0395\u03BB\u03BB\u03AC\u03B4\u03B1 2','http://www.kathimerini.gr/rss?i=news.el.search&q=&t=0&w=&c=&s=p&cat=4&type=&edition=PRINT&author=0&fromDate=&toDate=&page=1'),
(u'\u039A\u03CC\u03C3\u03BC\u03BF\u03C2','http://www.kathimerini.gr/rss?i=news.el.search&q=&t=0&w=&c=&s=p&cat=5&type=&edition=PRINT&author=0&fromDate=&toDate='),
(u'\u039A\u03CC\u03C3\u03BC\u03BF\u03C2 2','http://www.kathimerini.gr/rss?i=news.el.search&q=&t=0&w=&c=&s=p&cat=5&type=&edition=PRINT&author=0&fromDate=&toDate=&page=1'),
(u'\u03A0\u03C1\u03CC\u03C3\u03C9\u03C0\u03B1','http://www.kathimerini.gr/rss?i=news.el.search&q=&t=0&w=&c=&s=p&cat=41&cat=42&cat=43&cat=24&cat=25&type=&edition=PRINT&author=0&fromDate=&toDate='),
(u'\u0395\u03BB\u03BB\u03B7\u03BD\u03B9\u03BA\u03AE \u039F\u03B9\u03BA\u03BF\u03BD\u03BF\u03BC\u03AF\u03B1','http://www.kathimerini.gr/rss?i=news.el.search&q=&t=0&w=&c=&s=p&cat=17&type=&edition=PRINT&author=0&fromDate=&toDate='),
(u'\u0395\u03BB\u03BB\u03B7\u03BD\u03B9\u03BA\u03AE \u039F\u03B9\u03BA\u03BF\u03BD\u03BF\u03BC\u03AF\u03B1 2','http://www.kathimerini.gr/rss?i=news.el.search&q=&t=0&w=&c=&s=p&cat=17&type=&edition=PRINT&author=0&fromDate=&toDate=&page=1'),
(u'\u0395\u03C0\u03B9\u03C7\u03B5\u03B9\u03C1\u03AE\u03C3\u03B5\u03B9\u03C2','http://www.kathimerini.gr/rss?i=news.el.search&q=&t=0&w=&c=&s=p&cat=18&type=&edition=PRINT&author=0&fromDate=&toDate='),
(u'\u0394\u03B9\u03B5\u03B8\u03BD\u03AE\u03C2 \u039F\u03B9\u03BA\u03BF\u03BD\u03BF\u03BC\u03AF\u03B1','http://www.kathimerini.gr/rss?i=news.el.search&q=&t=0&w=&c=&s=p&cat=19&type=&edition=PRINT&author=0&fromDate=&toDate='),
(u'Real Estate','http://www.kathimerini.gr/rss?i=news.el.search&q=&t=0&w=&c=&s=p&cat=21&type=&edition=PRINT&author=0&fromDate=&toDate='),
(u'\u03A0\u03B5\u03C1\u03B9\u03B2\u03AC\u03BB\u03BB\u03BF\u03BD','http://www.kathimerini.gr/rss?i=news.el.search&q=&t=0&w=&c=&s=p&cat=6&type=&edition=PRINT&author=0&fromDate=&toDate='),
(u'\u0395\u03C0\u03B9\u03C3\u03C4\u03AE\u03BC\u03B7','http://www.kathimerini.gr/rss?i=news.el.search&q=&t=0&w=&c=&s=p&cat=7&type=&edition=PRINT&author=0&fromDate=&toDate='),
(u'\u03A4\u03B5\u03C7\u03BD\u03BF\u03BB\u03BF\u03B3\u03AF\u03B1','http://www.kathimerini.gr/rss?i=news.el.search&q=&t=0&w=&c=&s=p&cat=47&cat=48&cat=49&cat=50&cat=51&type=&edition=PRINT&author=0&fromDate=&toDate='),
(u'\u03A0\u03BF\u03BB\u03B9\u03C4\u03B9\u03C3\u03BC\u03CC\u03C2','http://www.kathimerini.gr/rss?i=news.el.search&q=&t=0&w=&c=&s=p&cat=31&cat=32&cat=33&cat=34&cat=35&cat=36&cat=37&cat=38&cat=39&type=&edition=PRINT&author=0&fromDate=&toDate='),
(u'\u03A0\u03BF\u03BB\u03B9\u03C4\u03B9\u03C3\u03BC\u03CC\u03C2 2','http://www.kathimerini.gr/rss?i=news.el.search&q=&t=0&w=&c=&s=p&cat=31&cat=32&cat=33&cat=34&cat=35&cat=36&cat=37&cat=38&cat=39&type=&edition=PRINT&author=0&fromDate=&toDate=&page=1'),
(u'\u03A4\u03B1\u03BE\u03AF\u03B4\u03B9\u03B1','http://www.kathimerini.gr/rss?i=news.el.search&q=&t=0&w=&c=&s=p&cat=11&cat=10&cat=12&cat=14&cat=15&cat=13&type=&edition=PRINT&author=0&fromDate=&toDate='),
(u'\u0391\u03B8\u03BB\u03B7\u03C4\u03B9\u03C3\u03BC\u03CC\u03C2','http://www.kathimerini.gr/rss?i=news.el.search&q=&t=0&w=&c=&s=p&cat=3&type=&edition=PRINT&author=0&fromDate=&toDate=')
]

    def get_cover_url(self):
       import time
       return 'http://s.kathimerini.gr/resources/issue-cover/%s.jpg' %time.strftime('%d-%m-%Y')

A few questions:
I don't know how to add the cover image. Can anyone help? The cover picture for yesterday, February 1st, 2014, was "http://s.kathimerini.gr/resources/issue-cover/01-02-2014.jpg"

I tried using only one RSS feed, but it wouldn't download more than 30 articles. So, I added &page=1 etc. to get the older articles. Is there a better way to do that?

Is there a way to include cartoons but not other images? Apparently, the only thing that differentiates them from other images is that cartoon pages include the class "article_SKETCH" in the <body> tag. I removed all images with:

Code:

    remove_tags            = [dict(attrs={'class':['clearing-featured-img']})]

kovidgoyal · 02-02-2014, 09:39 PM

Code:

def get_cover_url(self):
     import time
     return 'http://s.kathimerini.gr/resources/issue-cover/%s.jpg' %time.strftime('%d-%d-%Y')

You can concatenate rss feeds by re-implementing the parse_feeds() function in your recipe.

Implement preprocess_raw() in your recipe and replace clearing-featured-img by 'dont-remove-me' if the page is a comic page. Then remove tags wont affect it.

jennie · 02-03-2014, 03:16 AM

Hi Kovid, thanks a lot for your reply.

I still can't get the cover to work. Here's my latest code:

Spoiler:

I'm using 02 instead of %d for testing purposes, because there is no issue today.

I don't exactly know how to program in any language, so I'm having trouble using the rest of your advice. I don't think I want to try implementing parse_feed at this point, but I did try to read up on preprocess_raw_html, with no tangible results yet.
I'd really appreciate it if you could give me the complete fixed code.
I guess what I need to implement is something in the lines of:

Code:

if body contains the  class "article_SKETCH" (among others)
replace 'class':['clearing-featured-img'] with 'class':['do-not-remove']

kovidgoyal · 02-03-2014, 03:21 AM

I'm afraid I dont have the time to write the code for you, for the cover, you need to use this:

Code:

    def get_cover_url(self):
       import time
       return 'http://s.kathimerini.gr/resources/issue-cover/%s.jpg' %time.strftime('%d-%d-%Y')

The indentation for get_cover must be correct, it must be at the same level as the other members of the class.

jennie · 02-03-2014, 04:43 AM

The problem with the cover was the indentation. Thanx!

I have updated the code in the first post with this and a couple of other tags to remove. I might check the comics situation later.
If anyone else is interested in reading this newspaper and would like to give it a try, feel free.

As it is now, the code still gives a pretty clean result, so you could add it to the repository if there are no changes in, say, a week's time.

jennie · 02-06-2014, 08:01 AM

I updated the original recipe. It now downloads news in categories.
This issue persists:

Quote:

I tried using only one RSS feed, but it wouldn't download more than 30 articles. So, I added &page=1 etc. to get the older articles. Is there a better way to do that?

However, it only affects a couple of categories in the Sunday edition, so I'll leave it like that.

papadi · 07-18-2014, 11:13 AM

Could you please tell me how this works!? I'm new here. What is a recipe? Is it something that will allow me to read my paper on my kindle? What do I need to do?

02-03-2014, 04:43 AM	#5
jennie Member Posts: 14 Karma: 10 Join Date: Jun 2010 Device: kindle 3	The problem with the cover was the indentation. Thanx! I have updated the code in the first post with this and a couple of other tags to remove. I might check the comics situation later. If anyone else is interested in reading this newspaper and would like to give it a try, feel free. As it is now, the code still gives a pretty clean result, so you could add it to the repository if there are no changes in, say, a week's time. Last edited by jennie; 02-03-2014 at 04:51 AM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
New Greek News Recipe (TVXS)	hargikas	Recipes	3	04-11-2013 04:14 PM
Recipe for Berria (Basque newspaper)	arraintxo	Recipes	2	04-23-2012 05:44 AM
Kathimerini recipe on Kindle 3: Only first page shows	jennie	Recipes	2	05-27-2011 04:06 AM
Request:Recipe for malayalam newspaper	onenest	Recipes	0	04-29-2011 05:32 AM
Adding recipe for Tamil Newspaper	anthiyag	Recipes	1	04-08-2011 03:18 PM

02-02-2014, 09:39 PM	#2
kovidgoyal creator of calibre Posts: 45,337 Karma: 27182818 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Code: def get_cover_url(self): import time return 'http://s.kathimerini.gr/resources/issue-cover/%s.jpg' %time.strftime('%d-%d-%Y') You can concatenate rss feeds by re-implementing the parse_feeds() function in your recipe. Implement preprocess_raw() in your recipe and replace clearing-featured-img by 'dont-remove-me' if the page is a comic page. Then remove tags wont affect it.

02-03-2014, 03:21 AM	#4
kovidgoyal creator of calibre Posts: 45,337 Karma: 27182818 Join Date: Oct 2006 Location: Mumbai, India Device: Various	I'm afraid I dont have the time to write the code for you, for the cover, you need to use this: Code: def get_cover_url(self): import time return 'http://s.kathimerini.gr/resources/issue-cover/%s.jpg' %time.strftime('%d-%d-%Y') The indentation for get_cover must be correct, it must be at the same level as the other members of the class.

07-18-2014, 11:13 AM	#7
papadi Junior Member Posts: 1 Karma: 10 Join Date: Jul 2014 Device: Kindle PaperWhite	Could you please tell me how this works!? I'm new here. What is a recipe? Is it something that will allow me to read my paper on my kindle? What do I need to do?

Advert

Advert