View Single Post
Old 01-19-2014, 09:03 AM   #3
scissors
Addict
scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.
 
Posts: 241
Karma: 1001369
Join Date: Sep 2010
Device: prs300, kindle keyboard 3g
Hi Kovid

Thanks for that.

However, I rewrote the recipe as it was getting messy.
This is the new one, which seems a lot faster.

I would ask 1 question, regarding the code for auto clean up.
In the recipe I wanted photos and the writer info to not be cleaned up.

I used the following

auto_cleanup_keep = '//section[@class="photo"]'
#auto_cleanup_keep = '//div[@class="publish-info"]'
auto_cleanup = True

The 2nd line is commented out because when i add it the photos disappear. Is it a case of the auto_cleanup_keep command can only be used once?

Kind Regards
Dave


Express, new recipe
Spoiler:
Code:
import re

from calibre.web.feeds.news import BasicNewsRecipe
from calibre import browser
class AdvancedUserRecipe1390132023(BasicNewsRecipe):
    title          = u'Daily Express'
    __author__ = 'Dave Asbury'
   # 19.1.14 written due to website changes
    oldest_article = 1
    max_articles_per_feed = 10
    compress_news_images = True
    compress_news_images_max_size = 30
    ignore_duplicate_articles = {'title', 'url'}
    masthead_url = 'http://cdn.images.dailyexpress.co.uk/img/page/express_logo.png'
    auto_cleanup_keep = '//section[@class="photo"]'
    #auto_cleanup_keep = '//div[@class="publish-info"]' 
    auto_cleanup = True
    no_stylesheets        = False
    preprocess_regexps = [
		 (re.compile(r'\| [\w].+?\| [\w].+?\| Daily Express', re.IGNORECASE | re.DOTALL), lambda match: ''),
         	
         		]
    feeds          = [

		(u'UK News', u'http://www.express.co.uk/posts/rss/1/uk'),
                                (u'World News',u'http://www.express.co.uk/posts/rss/78/world'),
                                (u'Finance',u'http://www.express.co.uk/posts/rss/21/finance'),
                                (u'Sport',u'http://www.express.co.uk/posts/rss/65/sport'),
                                (u'Entertainment',u'http://www.express.co.uk/posts/rss/18/entertainment'),
                                (u'Lifestyle',u'http://www.express.co.uk/posts/rss/8/life&style'),
                                (u'Fun',u'http://www.express.co.uk/posts/rss/110/fun'),
                        ]

    def get_cover_url(self):
        print '============Cover ================='
        print
        soup = self.index_to_soup('http://www.express.co.uk/ourpaper/')
        cov = soup.find(attrs={'src' : re.compile('http://cdn.images.express.co.uk/img/covers/')})
        cov=str(cov)
        print '^^^^^^^', cov
        cov2 =  re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', cov)

        cov=str(cov2)
        cov=cov[2:len(cov)-2]

        print '&&&&&&&&',cov,'***'
        #cover_url=cov
        br = browser()
        br.set_handle_redirect(False)
        try:
            br.open_novisit(cov)
            cover_url = cov
        except:
            cover_url ='http://cdn.images.express.co.uk/img/static/ourpaper/header-back-issue-papers.jpg'

        return cover_url


    extra_css = '''
                    #h1{font-weight:bold;font-size:175%;}
                    h2{display: block;margin-left: auto;margin-right: auto;width:100%;font-weight:bold;font-size:175%;}
                    #p{font-size:14px;}
                    #body{font-size:14px;}
                    .photo-caption {display: block;margin-left: auto;margin-right: auto;width:100%;font-size:40%;}
                    .publish-info {font-size:50%;}
                    .photo img {display: block;margin-left: auto;margin-right: auto;width:100%;}
      '''
scissors is offline   Reply With Quote