Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 11-07-2010, 12:26 PM   #1
Bogg
Junior Member
Bogg began at the beginning.
 
Posts: 5
Karma: 10
Join Date: Jan 2009
Location: East Midlands, UK
Device: Sony PRS-505
Recipe for Metro UK

Has anyone out there got a recipe for the Metro (UK) ?

I had a look on their site (www.metro.co.uk) and they do provide RSS feeds, but when I tried a basic recipe it got the articles but spent ages processing the style sheets and the resulting output had no photos, or headlining, and had extrainious links.

Here is the basic recipe...

Code:
class AdvancedUserRecipe1289146844(BasicNewsRecipe):
    title          = u'MetroUK'
    oldest_article = 7
    max_articles_per_feed = 40

    feeds          = [(u'News', u'http://metro.co.uk/rss/news'), (u'Travel', u'http://metro.co.uk/rss/travel'), (u'Film', u'http://metro.co.uk/rss/metrolife/film'), (u'TV', u'http://www.metro.co.uk/rss/tv/'), (u'Tech & Gadgets', u'http://www.metro.co.uk/rss/tech/'), (u'Weird', u'http://metro.co.uk/rss/weird'), (u'Sport', u'http://www.metro.co.uk/rss/sport')]
Any help with even a basic clean up would be greatly appreciated.

BossHogg.
Bogg is offline   Reply With Quote
Old 11-08-2010, 02:48 AM   #2
marbs
Zealot
marbs began at the beginning.
 
Posts: 122
Karma: 10
Join Date: Jul 2010
Device: nook
try adding
Code:
 keep_only_tags = [
 	dict(name='h1', attrs={'':''}),
	dict(name='h2', attrs={'calss':'h2'}),
        dict(name='div', attrs={'calss':'art-lft'}),
	]
marbs is offline   Reply With Quote
 
Advertisement
Old 11-08-2010, 03:19 PM   #3
Bogg
Junior Member
Bogg began at the beginning.
 
Posts: 5
Karma: 10
Join Date: Jan 2009
Location: East Midlands, UK
Device: Sony PRS-505
Thanks. I tried that but I think I must have broken something else as it is now only producing a news doc that has links to where the articles came from rather than the text of the article itself.

I think it may be a bit beyond my current skill level so will leave it until someone who knows what their doing can produce a recipe for this popular UK daily.

Thanks for your suggestion though.
Bogg is offline   Reply With Quote
Old 05-24-2011, 01:34 PM   #4
scissors
Addict
scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.
 
Posts: 206
Karma: 1001369
Join Date: Sep 2010
Device: prs300, kindle keyboard 3g
Here's a recipe i've managed to codge together.
The epub output isn't so good, but i set the default ouput to LRF in Calibre and the result is much better on my PRS300. (takes +20 mins to process on my laptop)


class AdvancedUserRecipe1306097511(BasicNewsRecipe):
title = u'Metro UK'
oldest_article = 1
max_articles_per_feed = 100

keep_only_tags = [
dict(name='h1', attrs={'':''}),
dict(name='h2', attrs={'class':'h2'}),
dict(name='div', attrs={'class':'art-lft'})
]

remove_tags = [dict(name='div', attrs={'class':[ 'metroCommentFormWrap',
'commentForm', 'metroCommentInnerWrap',
'art-rgt','pluck-app pluck-comm' ]}),
dict(name='h3', attrs={'':''})]

feeds = [
(u'News', u'http://www.metro.co.uk/rss/news/'), (u'Money', u'http://www.metro.co.uk/rss/money/'), (u'Sport', u'http://www.metro.co.uk/rss/sport/'), (u'Film', u'http://www.metro.co.uk/rss/metrolife/film/'), (u'Music', u'http://www.metro.co.uk/rss/metrolife/music/'), (u'TV', u'http://www.metro.co.uk/rss/tv/'), (u'Showbiz', u'http://www.metro.co.uk/rss/showbiz/'), (u'Weird News', u'http://www.metro.co.uk/rss/weird/'), (u'Travel', u'http://www.metro.co.uk/rss/travel/'), (u'Lifestyle', u'http://www.metro.co.uk/rss/lifestyle/'), (u'Books', u'http://www.metro.co.uk/rss/lifestyle/books/'), (u'Food', u'http://www.metro.co.uk/rss/lifestyle/restaurants/')]

Last edited by scissors; 05-24-2011 at 03:36 PM.
scissors is offline   Reply With Quote
Old 05-30-2011, 02:30 PM   #5
scissors
Addict
scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.
 
Posts: 206
Karma: 1001369
Join Date: Sep 2010
Device: prs300, kindle keyboard 3g
Recipe editted.

css thrown out - this cuts processing down to less than 2 mins and epub works.

=====================================

class AdvancedUserRecipe1306097511(BasicNewsRecipe):
title = u'Metro UK'

no_stylesheets = True
oldest_article = 1
max_articles_per_feed = 200

author = 'Dave Asbury'
simultaneous_downloads= 3

masthead_url = 'http://e-edition.metro.co.uk/images/metro_logo.gif'

keep_only_tags = [
dict(name='h1', attrs={'':''}),
dict(name='h2', attrs={'class':'h2'}),
dict(name='div', attrs={'class':'art-lft'})
]
remove_tags = [dict(name='div', attrs={'class':[ 'metroCommentFormWrap',
'commentForm', 'metroCommentInnerWrap',
'art-rgt','pluck-app pluck-comm','news m12 clrd clr-l p5t', 'flt-r' ]})]

feeds = [
(u'News', u'http://www.metro.co.uk/rss/news/'), (u'Money', u'http://www.metro.co.uk/rss/money/'), (u'Sport', u'http://www.metro.co.uk/rss/sport/'), (u'Film', u'http://www.metro.co.uk/rss/metrolife/film/'), (u'Music', u'http://www.metro.co.uk/rss/metrolife/music/'), (u'TV', u'http://www.metro.co.uk/rss/tv/'), (u'Showbiz', u'http://www.metro.co.uk/rss/showbiz/'), (u'Weird News', u'http://www.metro.co.uk/rss/weird/'), (u'Travel', u'http://www.metro.co.uk/rss/travel/'), (u'Lifestyle', u'http://www.metro.co.uk/rss/lifestyle/'), (u'Books', u'http://www.metro.co.uk/rss/lifestyle/books/'), (u'Food', u'http://www.metro.co.uk/rss/lifestyle/restaurants/')]
scissors is offline   Reply With Quote
Old 05-30-2011, 05:29 PM   #6
scissors
Addict
scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.
 
Posts: 206
Karma: 1001369
Join Date: Sep 2010
Device: prs300, kindle keyboard 3g
Editted again... More pictures grabbed from feeds...

======================================
Spoiler:

Code:
class AdvancedUserRecipe1306097511(BasicNewsRecipe):
    title          = u'Metro UK'
    remove_empty_feeds = True
    no_stylesheets = True
    oldest_article = 1
    max_articles_per_feed = 200

    author = 'Dave Asbury'
    simultaneous_downloads= 3

    masthead_url        = 'http://e-edition.metro.co.uk/images/metro_logo.gif'
    
    keep_only_tags = [
                    dict(attrs={'class':['img-cnt figure']}),
	dict(attrs={'class':['art-img']}),
                    dict(name='h1'),
	dict(name='h2', attrs={'class':'h2'}),
                    dict(name='div', attrs={'class':'art-lft'})
	]
    remove_tags    = [dict(name='div', attrs={'class':[ 'metroCommentFormWrap',
                                                                             'commentForm', 'metroCommentInnerWrap',
			                 'art-rgt','pluck-app pluck-comm','news m12 clrd clr-l p5t', 'flt-r' ]})]
        
    feeds          = [
		(u'News', u'http://www.metro.co.uk/rss/news/'), (u'Money', u'http://www.metro.co.uk/rss/money/'), (u'Sport', u'http://www.metro.co.uk/rss/sport/'), (u'Film', u'http://www.metro.co.uk/rss/metrolife/film/'), (u'Music', u'http://www.metro.co.uk/rss/metrolife/music/'), (u'TV', u'http://www.metro.co.uk/rss/tv/'), (u'Showbiz', u'http://www.metro.co.uk/rss/showbiz/'), (u'Weird News', u'http://www.metro.co.uk/rss/weird/'), (u'Travel', u'http://www.metro.co.uk/rss/travel/'), (u'Lifestyle', u'http://www.metro.co.uk/rss/lifestyle/'), (u'Books', u'http://www.metro.co.uk/rss/lifestyle/books/'), (u'Food', u'http://www.metro.co.uk/rss/lifestyle/restaurants/')]

Last edited by scissors; 06-19-2011 at 02:58 PM. Reason: remove_empty_feeds = True added
scissors is offline   Reply With Quote
Old 05-30-2011, 08:12 PM   #7
hwangeruk
Junior Member
hwangeruk began at the beginning.
 
Posts: 6
Karma: 10
Join Date: May 2011
Device: kindle 3
Scissors/David, this is an excellent recipe. Metro seems nice and "clean" and easy to read, just the simple stories.
It also convinced me to DONATE to Kovid's program. Superb chaps.
hwangeruk is offline   Reply With Quote
Old 06-15-2011, 05:53 PM   #8
scissors
Addict
scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.
 
Posts: 206
Karma: 1001369
Join Date: Sep 2010
Device: prs300, kindle keyboard 3g
Latest edit. Takes out readers comments and moves 2nd headings above images.

Spoiler:

Code:
from calibre.web.feeds.news import BasicNewsRecipe
class AdvancedUserRecipe1306097511(BasicNewsRecipe):
    title          = u'Metro UK'
    description = 'News as provide by The Metro -UK'

    __author__ = 'Dave Asbury'
    no_stylesheets = True
    oldest_article = 1
    max_articles_per_feed = 25
    remove_empty_feeds = True
    remove_javascript     = True
    
    
    language = 'en_GB'
    

    masthead_url        = 'http://e-edition.metro.co.uk/images/metro_logo.gif'

    extra_css = 'h2 {font: sans-serif medium;}'
    keep_only_tags = [
	dict(name='h1'),dict(name='h2', attrs={'class':'h2'}),
                    dict(attrs={'class':['img-cnt figure']}),
    	dict(attrs={'class':['art-img']}),
                    
                    dict(name='div', attrs={'class':'art-lft'})
    ]
    remove_tags    = [dict(name='div', attrs={'class':[ 'news m12 clrd clr-b p5t shareBtm', 'commentForm', 'metroCommentInnerWrap',
                             'art-rgt','pluck-app pluck-comm','news m12 clrd clr-l p5t', 'flt-r' ]}),
	dict(attrs={'class':[ 'metroCommentFormWrap','commentText','commentsNav','avatar','submDateAndTime']})
	          ]
    feeds          = [
        (u'News', u'http://www.metro.co.uk/rss/news/'), (u'Money', u'http://www.metro.co.uk/rss/money/'), (u'Sport', u'http://www.metro.co.uk/rss/sport/'), (u'Film', u'http://www.metro.co.uk/rss/metrolife/film/'), (u'Music', u'http://www.metro.co.uk/rss/metrolife/music/'), (u'TV', u'http://www.metro.co.uk/rss/tv/'), (u'Showbiz', u'http://www.metro.co.uk/rss/showbiz/'), (u'Weird News', u'http://www.metro.co.uk/rss/weird/'), (u'Travel', u'http://www.metro.co.uk/rss/travel/'), (u'Lifestyle', u'http://www.metro.co.uk/rss/lifestyle/'), (u'Books', u'http://www.metro.co.uk/rss/lifestyle/books/'), (u'Food', u'http://www.metro.co.uk/rss/lifestyle/restaurants/')]

Last edited by scissors; 06-19-2011 at 02:56 PM.
scissors is offline   Reply With Quote
Old 06-18-2011, 08:46 AM   #9
scissors
Addict
scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.
 
Posts: 206
Karma: 1001369
Join Date: Sep 2010
Device: prs300, kindle keyboard 3g
Minor Fix (but still nowhere near perfect)

Bit of extra code added - removes the word "Tweet" that appeared occasionally at the end of articles.

Spoiler:

Code:
import time, re
from calibre.web.feeds.news import BasicNewsRecipe
class AdvancedUserRecipe1306097511(BasicNewsRecipe):
    title          = u'Metro UK'
    description = 'News as provide by The Metro -UK'

    __author__ = 'Dave Asbury'
    no_stylesheets = True
    oldest_article = 1
    max_articles_per_feed = 25
    remove_empty_feeds = True
    remove_javascript     = True
    
    preprocess_regexps = [(re.compile(r'Tweet'), lambda  a : '')]
    
    language = 'en_GB'
    

    masthead_url        = 'http://e-edition.metro.co.uk/images/metro_logo.gif'

    extra_css = 'h2 {font: sans-serif medium;}'
    keep_only_tags = [
	dict(name='h1'),dict(name='h2', attrs={'class':'h2'}),
                    dict(attrs={'class':['img-cnt figure']}),
    	dict(attrs={'class':['art-img']}),
                    
                    dict(name='div', attrs={'class':'art-lft'})
    ]
    remove_tags    = [dict(name='div', attrs={'class':[ 'news m12 clrd clr-b p5t shareBtm', 'commentForm', 'metroCommentInnerWrap',
                             'art-rgt','pluck-app pluck-comm','news m12 clrd clr-l p5t', 'flt-r' ]}),
	dict(attrs={'class':[ 'metroCommentFormWrap','commentText','commentsNav','avatar','submDateAndTime']})
	          ]
    feeds          = [
        (u'News', u'http://www.metro.co.uk/rss/news/'), (u'Money', u'http://www.metro.co.uk/rss/money/'), (u'Sport', u'http://www.metro.co.uk/rss/sport/'), (u'Film', u'http://www.metro.co.uk/rss/metrolife/film/'), (u'Music', u'http://www.metro.co.uk/rss/metrolife/music/'), (u'TV', u'http://www.metro.co.uk/rss/tv/'), (u'Showbiz', u'http://www.metro.co.uk/rss/showbiz/'), (u'Weird News', u'http://www.metro.co.uk/rss/weird/'), (u'Travel', u'http://www.metro.co.uk/rss/travel/'), (u'Lifestyle', u'http://www.metro.co.uk/rss/lifestyle/'), (u'Books', u'http://www.metro.co.uk/rss/lifestyle/books/'), (u'Food', u'http://www.metro.co.uk/rss/lifestyle/restaurants/')]

Last edited by scissors; 06-19-2011 at 02:56 PM.
scissors is offline   Reply With Quote
Old 06-18-2011, 10:01 PM   #10
hwangeruk
Junior Member
hwangeruk began at the beginning.
 
Posts: 6
Karma: 10
Join Date: May 2011
Device: kindle 3
inspiring. I'm going to get the manual out and start making my own recipes
Thanks!
hwangeruk is offline   Reply With Quote
Old 10-07-2011, 02:06 PM   #11
scissors
Addict
scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.
 
Posts: 206
Karma: 1001369
Join Date: Sep 2010
Device: prs300, kindle keyboard 3g
Reduced the size of the headlines and some other minor text formatting.
(looks better on my prs300)

Code:
import re
from calibre.web.feeds.news import BasicNewsRecipe
class AdvancedUserRecipe1306097511(BasicNewsRecipe):
    title          = u'Metro UK'
    description = 'News as provide by The Metro -UK'

    __author__ = 'Dave Asbury'
    cover_url = 'http://profile.ak.fbcdn.net/hprofile-ak-snc4/276636_117118184990145_2132092232_n.jpg'
    
    no_stylesheets = True
    oldest_article = 1
    max_articles_per_feed = 20
    remove_empty_feeds = True
    remove_javascript     = True

    #preprocess_regexps = [(re.compile(r'Tweet'), lambda  a : '')]
    preprocess_regexps = [
    (re.compile(r'<span class="img-cap legend">', re.IGNORECASE | re.DOTALL), lambda match: '<p></p><span class="img-cap legend"> ')]
    preprocess_regexps = [
    (re.compile(r'tweet', re.IGNORECASE | re.DOTALL), lambda match: '')]
    
    language = 'en_GB'


    masthead_url        = 'http://e-edition.metro.co.uk/images/metro_logo.gif'

    
    keep_only_tags = [
	dict(name='h1'),dict(name='h2', attrs={'class':'h2'}),
                    dict(attrs={'class':['img-cnt figure']}),
    	dict(attrs={'class':['art-img']}),
                    dict(name='div', attrs={'class':'art-lft'}),
                    dict(name='p')
    ]
    remove_tags    = [dict(name='div', attrs={'class':[ 'news m12 clrd clr-b p5t shareBtm', 'commentForm', 'metroCommentInnerWrap',
                             'art-rgt','pluck-app pluck-comm','news m12 clrd clr-l p5t', 'flt-r' ]}),
	          dict(attrs={'class':[ 'metroCommentFormWrap','commentText','commentsNav','avatar','submDateAndTime']})
                              ,dict(name='div', attrs={'class' : 'clrd art-fd fd-gr1-b'})
                               ]
    feeds          = [
        (u'News', u'http://www.metro.co.uk/rss/news/'), (u'Money', u'http://www.metro.co.uk/rss/money/'), (u'Sport', u'http://www.metro.co.uk/rss/sport/'), (u'Film', u'http://www.metro.co.uk/rss/metrolife/film/'), (u'Music', u'http://www.metro.co.uk/rss/metrolife/music/'), (u'TV', u'http://www.metro.co.uk/rss/tv/'), (u'Showbiz', u'http://www.metro.co.uk/rss/showbiz/'), (u'Weird News', u'http://www.metro.co.uk/rss/weird/'), (u'Travel', u'http://www.metro.co.uk/rss/travel/'), (u'Lifestyle', u'http://www.metro.co.uk/rss/lifestyle/'), (u'Books', u'http://www.metro.co.uk/rss/lifestyle/books/'), (u'Food', u'http://www.metro.co.uk/rss/lifestyle/restaurants/')]

    extra_css  = '''
                    body {font: sans-serif medium;}'
	h1 {text-align : center; font-family:Arial,Helvetica,sans-serif; font-size:20px; font-size-adjust:none; font-stretch:normal; font-style:normal; font-variant:normal; font-weight:bold;}
               	h2 {text-align : center;color:#4D4D4D;font-family:Arial,Helvetica,sans-serif; font-size:15px; font-size-adjust:none; font-stretch:normal; font-style:normal; font-variant:normal; font-weight:bold; }
                	span{ font-size:9.5px; font-weight:bold;font-style:italic}
                    p { text-align: justify; font-family:Arial,Helvetica,sans-serif; font-size:11px; font-size-adjust:none; font-stretch:normal; font-style:normal; font-variant:normal; font-weight:normal;}
                	
	 '''

Last edited by scissors; 10-07-2011 at 03:15 PM. Reason: cover url added
scissors is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Recipe works when mocked up as Python file, fails when converted to Recipe ode Recipes 7 09-04-2011 05:57 AM
Metro Map Viewer faxi PocketBook 7 07-31-2010 08:50 AM
Chit-Chat Le journal Métro parle des Ebooks Mikael le Fou Forum Français 26 04-03-2010 04:53 PM
Article and competition in the London Metro: Riocaz News 0 09-02-2008 12:14 PM
Intel Metro Notebook: a new use for e-ink Hadrien News 2 04-17-2007 04:21 PM


All times are GMT -4. The time now is 11:44 AM.


MobileRead.com is a privately owned, operated and funded community.