FT Recipe 10/27

rainrdx · 10-27-2012, 09:44 AM

It is an accumulative updates which includes a bunch of bug fixes, aesthetic changes, etc. I have done over the past few weeks. Hopefully it helps

Code:

__license__   = 'GPL v3'
__copyright__ = '2010-2011, Darko Miletic <darko.miletic at gmail.com>'
'''
www.ft.com/uk-edition
'''

import datetime
from calibre.ptempfile import PersistentTemporaryFile
from calibre import strftime
from calibre.web.feeds.news import BasicNewsRecipe

class FinancialTimes(BasicNewsRecipe):
    title                 = 'Financial Times (UK)'
    __author__            = 'Darko Miletic'
    description           = "The Financial Times (FT) is one of the world's leading business news and information organisations, recognised internationally for its authority, integrity and accuracy."
    publisher             = 'The Financial Times Ltd.'
    category              = 'news, finances, politics, UK, World'
    oldest_article        = 2
    language              = 'en_GB'
    max_articles_per_feed = 250
    no_stylesheets        = True
    use_embedded_content  = False
    needs_subscription    = True
    encoding              = 'utf8'
    publication_type      = 'newspaper'
    articles_are_obfuscated = True
    temp_files              = []
    masthead_url          = 'http://im.media.ft.com/m/img/masthead_main.jpg'
    LOGIN                 = 'https://registration.ft.com/registration/barrier/login'
    LOGIN2                = 'http://media.ft.com/h/subs3.html'
    INDEX                 = 'http://www.ft.com/uk-edition'
    PREFIX                = 'http://www.ft.com'

    conversion_options = {
                          'comment'          : description
                        , 'tags'             : category
                        , 'publisher'        : publisher
                        , 'language'         : language
                        , 'linearize_tables' : True
                        }

    def get_browser(self):
        br = BasicNewsRecipe.get_browser()
        br.open(self.INDEX)
        br.open(self.LOGIN)
        br.select_form(name='loginForm')
            br['username'] = self.username
            br['password'] = self.password
        br.submit()
        return br

    keep_only_tags = [
                        dict(name='div', attrs={'class':['fullstory fullstoryHeader', 'ft-story-header']})
                       ,dict(name='div', attrs={'class':'standfirst'})
                       ,dict(name='div', attrs={'id'   :'storyContent'})
                       ,dict(name='div', attrs={'class':['ft-story-body','index-detail']})
                     ]
    remove_tags = [
                      dict(name='div', attrs={'id':'floating-con'})
                     ,dict(name=['meta','iframe','base','object','embed','link'])
                     ,dict(attrs={'class':['storyTools','story-package','screen-copy','story-package separator','expandable-image']})
                  ]
    remove_attributes = ['width','height','lang']

    extra_css = """
                body{font-family: Georgia,Times,"Times New Roman",serif}
                h2{font-size:large}
                .ft-story-header{font-size: x-small}
                .container{font-size:x-small;}
                h3{font-size:x-small;color:#003399;}
                .copyright{font-size: x-small}
                img{margin-top: 0.8em; display: block}
                .lastUpdated{font-family: Arial,Helvetica,sans-serif; font-size: x-small}
                .byline,.ft-story-body,.ft-story-header{font-family: Arial,Helvetica,sans-serif}
                """

    def get_artlinks(self, elem):
        articles = []
        count = 0
        for item in elem.findAll('a',href=True):
            count = count + 1
            if self.test and count > 2:
               return articles
            rawlink = item['href']
            if rawlink.startswith('http://'):
               url = rawlink
            else:
               url   = self.PREFIX + rawlink
            try:
		urlverified = self.browser.open_novisit(url).geturl() # resolve redirect.
	    except:
		continue
            title = self.tag_to_string(item)
            date = strftime(self.timefmt)
            articles.append({
                              'title'      :title
                             ,'date'       :date
                             ,'url'        :urlverified
                             ,'description':''
                            })
        return articles

    def parse_index(self):
        feeds = []
        soup = self.index_to_soup(self.INDEX)
	dates= self.tag_to_string(soup.find('div', attrs={'class':'btm-links'}).find('div'))
	self.timefmt = ' [%s]'%dates
        wide = soup.find('div',attrs={'class':'wide'})
        if not wide:
           return feeds
        strest = wide.findAll('h3', attrs={'class':'section'})
        if not strest:
           return feeds
        st = wide.findAll('h4',attrs={'class':'section-no-arrow'})
        if st:
           st.extend(strest)
        count = 0
        for item in st:
            count = count + 1
            if self.test and count > 2:
               return feeds
            ftitle   = self.tag_to_string(item)
            self.report_progress(0, _('Fetching feed')+' %s...'%(ftitle))
            if item.parent.ul is not None:
	            feedarts = self.get_artlinks(item.parent.ul)
            feeds.append((ftitle,feedarts))
        return feeds

    def preprocess_html(self, soup):
        items = ['promo-box','promo-title',
                 'promo-headline','promo-image',
                 'promo-intro','promo-link','subhead']
        for item in items:
            for it in soup.findAll(item):
                it.name = 'div'
                it.attrs = []
        for item in soup.findAll(style=True):
            del item['style']
        for item in soup.findAll('a'):
            limg = item.find('img')
            if item.string is not None:
               str = item.string
               item.replaceWith(str)
            else:
               if limg:
                  item.name = 'div'
                  item.attrs = []
               else:
                   str = self.tag_to_string(item)
                   item.replaceWith(str)
        for item in soup.findAll('img'):
            if not item.has_key('alt'):
               item['alt'] = 'image'
        return soup

    def get_cover_url(self):
        cdate = datetime.date.today()
        if cdate.isoweekday() == 7:
           cdate -= datetime.timedelta(days=1)
        return cdate.strftime('http://specials.ft.com/vtf_pdf/%d%m%y_FRONT1_LON.pdf')

    def get_obfuscated_article(self, url):
        count = 0
        while (count < 10):
            try:
                response = self.browser.open(url)
                html = response.read()
                count = 10
            except:
                print "Retrying download..."
            count += 1        
        self.temp_files.append(PersistentTemporaryFile('_fa.html'))
        self.temp_files[-1].write(html)
        self.temp_files[-1].close()
        return self.temp_files[-1].name
       
    def cleanup(self):
        self.browser.open('https://registration.ft.com/registration/login/logout?location=')

kiklop74 · 10-28-2012, 08:17 AM

Kovid, do no integrate this. It uses older version of ft recipe as a base. I will provide updated version in the tracker.

kovidgoyal · 10-28-2012, 08:24 AM

Quote:

Originally Posted by kiklop74

Kovid, do no integrate this. It uses older version of ft recipe as a base. I will provide updated version in the tracker.

Too late, but I'll update it again when you post your version

kiklop74 · 10-29-2012, 01:28 PM

The only real changes here where these:

Quote:

try:
urlverified = self.browser.open_novisit(url).geturl() # resolve redirect.
except:
continue

And here

Quote:

def cleanup(self):
self.browser.open('https://registration.ft.com/registration/login/logout?location=')

So in future please contact me so that we can avoid these conflicts.

kovidgoyal · 10-29-2012, 01:50 PM

Do you really want to be contacted every time someone makes an update to one of your recipes? That tends to happen rather a lot. I usually review the changes and if they look ok I merge them.

kiklop74 · 10-29-2012, 03:08 PM

There is no firm law or anything for this. I guess we can leave that to the personal judgement of the users doing the improvements. I don't have a problem with the actual update, the problem is when supposedly "new" code reintroduces old bugs.

kovidgoyal · 10-29-2012, 10:27 PM

That can happen once in a while, but I think that it is worth it, the alternative would mean a lot of work for both of us and much slower recipe updates.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Recipe works when mocked up as Python file, fails when converted to Recipe	ode	Recipes	7	09-04-2011 04:57 AM
New Recipe	UtahJames	Recipes	2	04-18-2011 06:32 PM
Recipe Please	gagw	Recipes	0	01-24-2011 07:24 AM
Recipe Help	hellonewman	Calibre	1	01-23-2010 03:45 AM
Recipe Help Please	estral	Calibre	1	06-11-2009 02:35 PM

10-28-2012, 08:17 AM	#2
kiklop74 Guru Posts: 800 Karma: 194644 Join Date: Dec 2007 Location: Argentina Device: Kindle Voyage	Kovid, do no integrate this. It uses older version of ft recipe as a base. I will provide updated version in the tracker.

10-29-2012, 01:50 PM	#5
kovidgoyal creator of calibre Posts: 45,373 Karma: 27230406 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Do you really want to be contacted every time someone makes an update to one of your recipes? That tends to happen rather a lot. I usually review the changes and if they look ok I merge them.

10-29-2012, 03:08 PM	#6
kiklop74 Guru Posts: 800 Karma: 194644 Join Date: Dec 2007 Location: Argentina Device: Kindle Voyage	There is no firm law or anything for this. I guess we can leave that to the personal judgement of the users doing the improvements. I don't have a problem with the actual update, the problem is when supposedly "new" code reintroduces old bugs.

10-29-2012, 10:27 PM	#7
kovidgoyal creator of calibre Posts: 45,373 Karma: 27230406 Join Date: Oct 2006 Location: Mumbai, India Device: Various	That can happen once in a while, but I think that it is worth it, the alternative would mean a lot of work for both of us and much slower recipe updates.

Advert

Advert