View Full Version : Recipe for Metro UK


Bogg
11-07-2010, 11:26 AM
Has anyone out there got a recipe for the Metro (UK) ?

I had a look on their site (www.metro.co.uk) and they do provide RSS feeds, but when I tried a basic recipe it got the articles but spent ages processing the style sheets and the resulting output had no photos, or headlining, and had extrainious links.

Here is the basic recipe...

class AdvancedUserRecipe1289146844(BasicNewsRecipe):
title = u'MetroUK'
oldest_article = 7
max_articles_per_feed = 40

feeds = [(u'News', u'http://metro.co.uk/rss/news'), (u'Travel', u'http://metro.co.uk/rss/travel'), (u'Film', u'http://metro.co.uk/rss/metrolife/film'), (u'TV', u'http://www.metro.co.uk/rss/tv/'), (u'Tech & Gadgets', u'http://www.metro.co.uk/rss/tech/'), (u'Weird', u'http://metro.co.uk/rss/weird'), (u'Sport', u'http://www.metro.co.uk/rss/sport')]

Any help with even a basic clean up would be greatly appreciated.

BossHogg.

marbs
11-08-2010, 01:48 AM
try adding
keep_only_tags = [
dict(name='h1', attrs={'':''}),
dict(name='h2', attrs={'calss':'h2'}),
dict(name='div', attrs={'calss':'art-lft'}),
]

Bogg
11-08-2010, 02:19 PM
Thanks. I tried that but I think I must have broken something else as it is now only producing a news doc that has links to where the articles came from rather than the text of the article itself.

I think it may be a bit beyond my current skill level so will leave it until someone who knows what their doing can produce a recipe for this popular UK daily.

Thanks for your suggestion though.

scissors
05-24-2011, 12:34 PM
Here's a recipe i've managed to codge together.
The epub output isn't so good, but i set the default ouput to LRF in Calibre and the result is much better on my PRS300. (takes +20 mins to process on my laptop)


class AdvancedUserRecipe1306097511(BasicNewsRecipe):
title = u'Metro UK'
oldest_article = 1
max_articles_per_feed = 100

keep_only_tags = [
dict(name='h1', attrs={'':''}),
dict(name='h2', attrs={'class':'h2'}),
dict(name='div', attrs={'class':'art-lft'})
]

remove_tags = [dict(name='div', attrs={'class':[ 'metroCommentFormWrap',
'commentForm', 'metroCommentInnerWrap',
'art-rgt','pluck-app pluck-comm' ]}),
dict(name='h3', attrs={'':''})]

feeds = [
(u'News', u'http://www.metro.co.uk/rss/news/'), (u'Money', u'http://www.metro.co.uk/rss/money/'), (u'Sport', u'http://www.metro.co.uk/rss/sport/'), (u'Film', u'http://www.metro.co.uk/rss/metrolife/film/'), (u'Music', u'http://www.metro.co.uk/rss/metrolife/music/'), (u'TV', u'http://www.metro.co.uk/rss/tv/'), (u'Showbiz', u'http://www.metro.co.uk/rss/showbiz/'), (u'Weird News', u'http://www.metro.co.uk/rss/weird/'), (u'Travel', u'http://www.metro.co.uk/rss/travel/'), (u'Lifestyle', u'http://www.metro.co.uk/rss/lifestyle/'), (u'Books', u'http://www.metro.co.uk/rss/lifestyle/books/'), (u'Food', u'http://www.metro.co.uk/rss/lifestyle/restaurants/')]

scissors
05-30-2011, 01:30 PM
Recipe editted.

css thrown out - this cuts processing down to less than 2 mins and epub works.

=====================================

class AdvancedUserRecipe1306097511(BasicNewsRecipe):
title = u'Metro UK'

no_stylesheets = True
oldest_article = 1
max_articles_per_feed = 200

author = 'Dave Asbury'
simultaneous_downloads= 3

masthead_url = 'http://e-edition.metro.co.uk/images/metro_logo.gif'

keep_only_tags = [
dict(name='h1', attrs={'':''}),
dict(name='h2', attrs={'class':'h2'}),
dict(name='div', attrs={'class':'art-lft'})
]
remove_tags = [dict(name='div', attrs={'class':[ 'metroCommentFormWrap',
'commentForm', 'metroCommentInnerWrap',
'art-rgt','pluck-app pluck-comm','news m12 clrd clr-l p5t', 'flt-r' ]})]

feeds = [
(u'News', u'http://www.metro.co.uk/rss/news/'), (u'Money', u'http://www.metro.co.uk/rss/money/'), (u'Sport', u'http://www.metro.co.uk/rss/sport/'), (u'Film', u'http://www.metro.co.uk/rss/metrolife/film/'), (u'Music', u'http://www.metro.co.uk/rss/metrolife/music/'), (u'TV', u'http://www.metro.co.uk/rss/tv/'), (u'Showbiz', u'http://www.metro.co.uk/rss/showbiz/'), (u'Weird News', u'http://www.metro.co.uk/rss/weird/'), (u'Travel', u'http://www.metro.co.uk/rss/travel/'), (u'Lifestyle', u'http://www.metro.co.uk/rss/lifestyle/'), (u'Books', u'http://www.metro.co.uk/rss/lifestyle/books/'), (u'Food', u'http://www.metro.co.uk/rss/lifestyle/restaurants/')]

scissors
05-30-2011, 04:29 PM
Editted again...:smack: More pictures grabbed from feeds...

======================================


class AdvancedUserRecipe1306097511(BasicNewsRecipe):
title = u'Metro UK'
remove_empty_feeds = True
no_stylesheets = True
oldest_article = 1
max_articles_per_feed = 200

author = 'Dave Asbury'
simultaneous_downloads= 3

masthead_url = 'http://e-edition.metro.co.uk/images/metro_logo.gif'

keep_only_tags = [
dict(attrs={'class':['img-cnt figure']}),
dict(attrs={'class':['art-img']}),
dict(name='h1'),
dict(name='h2', attrs={'class':'h2'}),
dict(name='div', attrs={'class':'art-lft'})
]
remove_tags = [dict(name='div', attrs={'class':[ 'metroCommentFormWrap',
'commentForm', 'metroCommentInnerWrap',
'art-rgt','pluck-app pluck-comm','news m12 clrd clr-l p5t', 'flt-r' ]})]

feeds = [
(u'News', u'http://www.metro.co.uk/rss/news/'), (u'Money', u'http://www.metro.co.uk/rss/money/'), (u'Sport', u'http://www.metro.co.uk/rss/sport/'), (u'Film', u'http://www.metro.co.uk/rss/metrolife/film/'), (u'Music', u'http://www.metro.co.uk/rss/metrolife/music/'), (u'TV', u'http://www.metro.co.uk/rss/tv/'), (u'Showbiz', u'http://www.metro.co.uk/rss/showbiz/'), (u'Weird News', u'http://www.metro.co.uk/rss/weird/'), (u'Travel', u'http://www.metro.co.uk/rss/travel/'), (u'Lifestyle', u'http://www.metro.co.uk/rss/lifestyle/'), (u'Books', u'http://www.metro.co.uk/rss/lifestyle/books/'), (u'Food', u'http://www.metro.co.uk/rss/lifestyle/restaurants/')]

hwangeruk
05-30-2011, 07:12 PM
Scissors/David, this is an excellent recipe. Metro seems nice and "clean" and easy to read, just the simple stories.
It also convinced me to DONATE to Kovid's program. Superb chaps.

scissors
06-15-2011, 04:53 PM
Latest edit. Takes out readers comments and moves 2nd headings above images.




from calibre.web.feeds.news import BasicNewsRecipe
class AdvancedUserRecipe1306097511(BasicNewsRecipe):
title = u'Metro UK'
description = 'News as provide by The Metro -UK'

__author__ = 'Dave Asbury'
no_stylesheets = True
oldest_article = 1
max_articles_per_feed = 25
remove_empty_feeds = True
remove_javascript = True


language = 'en_GB'


masthead_url = 'http://e-edition.metro.co.uk/images/metro_logo.gif'

extra_css = 'h2 {font: sans-serif medium;}'
keep_only_tags = [
dict(name='h1'),dict(name='h2', attrs={'class':'h2'}),
dict(attrs={'class':['img-cnt figure']}),
dict(attrs={'class':['art-img']}),

dict(name='div', attrs={'class':'art-lft'})
]
remove_tags = [dict(name='div', attrs={'class':[ 'news m12 clrd clr-b p5t shareBtm', 'commentForm', 'metroCommentInnerWrap',
'art-rgt','pluck-app pluck-comm','news m12 clrd clr-l p5t', 'flt-r' ]}),
dict(attrs={'class':[ 'metroCommentFormWrap','commentText','commentsNav' ,'avatar','submDateAndTime']})
]
feeds = [
(u'News', u'http://www.metro.co.uk/rss/news/'), (u'Money', u'http://www.metro.co.uk/rss/money/'), (u'Sport', u'http://www.metro.co.uk/rss/sport/'), (u'Film', u'http://www.metro.co.uk/rss/metrolife/film/'), (u'Music', u'http://www.metro.co.uk/rss/metrolife/music/'), (u'TV', u'http://www.metro.co.uk/rss/tv/'), (u'Showbiz', u'http://www.metro.co.uk/rss/showbiz/'), (u'Weird News', u'http://www.metro.co.uk/rss/weird/'), (u'Travel', u'http://www.metro.co.uk/rss/travel/'), (u'Lifestyle', u'http://www.metro.co.uk/rss/lifestyle/'), (u'Books', u'http://www.metro.co.uk/rss/lifestyle/books/'), (u'Food', u'http://www.metro.co.uk/rss/lifestyle/restaurants/')]

scissors
06-18-2011, 07:46 AM
Bit of extra code added - removes the word "Tweet" that appeared occasionally at the end of articles.




import time, re
from calibre.web.feeds.news import BasicNewsRecipe
class AdvancedUserRecipe1306097511(BasicNewsRecipe):
title = u'Metro UK'
description = 'News as provide by The Metro -UK'

__author__ = 'Dave Asbury'
no_stylesheets = True
oldest_article = 1
max_articles_per_feed = 25
remove_empty_feeds = True
remove_javascript = True

preprocess_regexps = [(re.compile(r'Tweet'), lambda a : '')]

language = 'en_GB'


masthead_url = 'http://e-edition.metro.co.uk/images/metro_logo.gif'

extra_css = 'h2 {font: sans-serif medium;}'
keep_only_tags = [
dict(name='h1'),dict(name='h2', attrs={'class':'h2'}),
dict(attrs={'class':['img-cnt figure']}),
dict(attrs={'class':['art-img']}),

dict(name='div', attrs={'class':'art-lft'})
]
remove_tags = [dict(name='div', attrs={'class':[ 'news m12 clrd clr-b p5t shareBtm', 'commentForm', 'metroCommentInnerWrap',
'art-rgt','pluck-app pluck-comm','news m12 clrd clr-l p5t', 'flt-r' ]}),
dict(attrs={'class':[ 'metroCommentFormWrap','commentText','commentsNav' ,'avatar','submDateAndTime']})
]
feeds = [
(u'News', u'http://www.metro.co.uk/rss/news/'), (u'Money', u'http://www.metro.co.uk/rss/money/'), (u'Sport', u'http://www.metro.co.uk/rss/sport/'), (u'Film', u'http://www.metro.co.uk/rss/metrolife/film/'), (u'Music', u'http://www.metro.co.uk/rss/metrolife/music/'), (u'TV', u'http://www.metro.co.uk/rss/tv/'), (u'Showbiz', u'http://www.metro.co.uk/rss/showbiz/'), (u'Weird News', u'http://www.metro.co.uk/rss/weird/'), (u'Travel', u'http://www.metro.co.uk/rss/travel/'), (u'Lifestyle', u'http://www.metro.co.uk/rss/lifestyle/'), (u'Books', u'http://www.metro.co.uk/rss/lifestyle/books/'), (u'Food', u'http://www.metro.co.uk/rss/lifestyle/restaurants/')]

hwangeruk
06-18-2011, 09:01 PM
inspiring. I'm going to get the manual out and start making my own recipes :)
Thanks!

scissors
10-07-2011, 01:06 PM
Reduced the size of the headlines and some other minor text formatting.
(looks better on my prs300)


import re
from calibre.web.feeds.news import BasicNewsRecipe
class AdvancedUserRecipe1306097511(BasicNewsRecipe):
title = u'Metro UK'
description = 'News as provide by The Metro -UK'

__author__ = 'Dave Asbury'
cover_url = 'http://profile.ak.fbcdn.net/hprofile-ak-snc4/276636_117118184990145_2132092232_n.jpg'

no_stylesheets = True
oldest_article = 1
max_articles_per_feed = 20
remove_empty_feeds = True
remove_javascript = True

#preprocess_regexps = [(re.compile(r'Tweet'), lambda a : '')]
preprocess_regexps = [
(re.compile(r'<span class="img-cap legend">', re.IGNORECASE | re.DOTALL), lambda match: '<p></p><span class="img-cap legend"> ')]
preprocess_regexps = [
(re.compile(r'tweet', re.IGNORECASE | re.DOTALL), lambda match: '')]

language = 'en_GB'


masthead_url = 'http://e-edition.metro.co.uk/images/metro_logo.gif'


keep_only_tags = [
dict(name='h1'),dict(name='h2', attrs={'class':'h2'}),
dict(attrs={'class':['img-cnt figure']}),
dict(attrs={'class':['art-img']}),
dict(name='div', attrs={'class':'art-lft'}),
dict(name='p')
]
remove_tags = [dict(name='div', attrs={'class':[ 'news m12 clrd clr-b p5t shareBtm', 'commentForm', 'metroCommentInnerWrap',
'art-rgt','pluck-app pluck-comm','news m12 clrd clr-l p5t', 'flt-r' ]}),
dict(attrs={'class':[ 'metroCommentFormWrap','commentText','commentsNav' ,'avatar','submDateAndTime']})
,dict(name='div', attrs={'class' : 'clrd art-fd fd-gr1-b'})
]
feeds = [
(u'News', u'http://www.metro.co.uk/rss/news/'), (u'Money', u'http://www.metro.co.uk/rss/money/'), (u'Sport', u'http://www.metro.co.uk/rss/sport/'), (u'Film', u'http://www.metro.co.uk/rss/metrolife/film/'), (u'Music', u'http://www.metro.co.uk/rss/metrolife/music/'), (u'TV', u'http://www.metro.co.uk/rss/tv/'), (u'Showbiz', u'http://www.metro.co.uk/rss/showbiz/'), (u'Weird News', u'http://www.metro.co.uk/rss/weird/'), (u'Travel', u'http://www.metro.co.uk/rss/travel/'), (u'Lifestyle', u'http://www.metro.co.uk/rss/lifestyle/'), (u'Books', u'http://www.metro.co.uk/rss/lifestyle/books/'), (u'Food', u'http://www.metro.co.uk/rss/lifestyle/restaurants/')]

extra_css = '''
body {font: sans-serif medium;}'
h1 {text-align : center; font-family:Arial,Helvetica,sans-serif; font-size:20px; font-size-adjust:none; font-stretch:normal; font-style:normal; font-variant:normal; font-weight:bold;}
h2 {text-align : center;color:#4D4D4D;font-family:Arial,Helvetica,sans-serif; font-size:15px; font-size-adjust:none; font-stretch:normal; font-style:normal; font-variant:normal; font-weight:bold; }
span{ font-size:9.5px; font-weight:bold;font-style:italic}
p { text-align: justify; font-family:Arial,Helvetica,sans-serif; font-size:11px; font-size-adjust:none; font-stretch:normal; font-style:normal; font-variant:normal; font-weight:normal;}

'''