MobileRead Forums - View Single Post - Fixed the Wired Magazine Recipe (not daily)...kind of

zachlapidus · 04-17-2015, 01:18 PM

Hi, first time poster. Thanks to Kovid, the community, and everyone for all this amazing work!

I've used calibre since I've got my kindle and it's been amazing. One of my absolute favorites from around 2011 when I started was the Wired Magazine feed -- at that time it was primarily long, detailed articles from the print edition.

I recently started using calibre again and was disappointed to see that the Wired recipe is currently broken, and appears to not have worked for quite some time. The Wired Daily Edition recipe is working, but seems to pull a daily digest the latest posts, which are more short news stories, with the occasional longer article.

I am not really a python programmer at all, but I read a little of the API documentation and I made a hacky modification to the Wired Daily script that only pulls articles with the "Magazine" tag from page 1 and 2 from here: http://wired.com/category/magazine/page/1. I'm sure someone more experienced than me can make a better version, but I don't think it's that bad for a first go-round.

Hope this is okay to post here.

Code:

__license__   = 'GPL v3'
__copyright__ = '2014, Darko Miletic <darko.miletic at gmail.com>'
'''
www.wired.com
'''

from calibre.web.feeds.news import BasicNewsRecipe
from calibre import strftime

class WiredDailyNews(BasicNewsRecipe):
    title                 = 'Wired Magazine, Monthly Edition'
    __author__            = 'Darko Miletic, update by Zach Lapidus'
    description           = ('Wired is a full-color monthly American magazine, published in both print '
                             'and online editions, that reports on how emerging technologies affect culture,'
                             'the economy and politics.')
    publisher             = 'Conde Nast'
    category              = 'news, IT, computers, technology'
    oldest_article        = 2
    max_articles_per_feed = 200
    no_stylesheets        = True
    encoding              = 'utf-8'
    use_embedded_content  = False
    language              = 'en'
    ignore_duplicate_articles = {'url'}
    remove_empty_feeds    = True
    publication_type      = 'newsportal'
    extra_css             = """
                            .entry-header{
                                          text-transform: uppercase;
                                          vertical-align: baseline;
                                          display: inline;
                                         }
                            ul li{display: inline}
                            """

    remove_tags = [
        dict(name=['meta','link']),
        dict(name='div', attrs={'class':'podcast_storyboard'}),
        dict(id=['sharing', 'social', 'article-tags', 'sidebar']),
                  ]
    keep_only_tags=[
        dict(attrs={'data-js':['post', 'postHeader']}),
    ]
    
    def parse_index(self):
        totalfeeds = []
        #first page 1
        soup   = self.index_to_soup('http://www.wired.com/category/magazine/page/1')
        majorf = soup.find('main')
        articles = []
        checker = []
        if majorf:
           for a in majorf.findAll('a', href=True):
               if a['href'].startswith('http://www.wired.com/') and a['href'].endswith('/'):
                  #title = self.tag_to_string(a)
                  
                  titleloc = a.find('h2')
                  title = self.tag_to_string(titleloc)
                  url = a['href']
                  dateloc = a.find('time')
                  date = self.tag_to_string(dateloc)
                  
                  if title.lower() != 'read more' and title and url not in checker:
                      checker.append(url) 
                      articles.append({
                                          'title'      :title
                                         ,'date'       :date
                                         ,'url'        :a['href']
                                         ,'description':''
                                        })
           totalfeeds.append(('Articles', articles))
        # then do page 2   
        soup   = self.index_to_soup('http://www.wired.com/category/magazine/page/2')
        majorf = soup.find('main')
        if majorf:
           for a in majorf.findAll('a', href=True):
               if a['href'].startswith('http://www.wired.com/') and a['href'].endswith('/'):
                  #title = self.tag_to_string(a)
                  
                  titleloc = a.find('h2')
                  title = self.tag_to_string(titleloc)
                  url = a['href']
                  dateloc = a.find('time')
                  date = self.tag_to_string(dateloc)
                  
                  if title.lower() != 'read more' and title and url not in checker:
                      checker.append(url) 
                      articles.append({
                                          'title'      :title
                                         ,'date'       :date
                                         ,'url'        :a['href']
                                         ,'description':''
                                        })
           totalfeeds.append(('Articles', articles))
        return totalfeeds



    def get_article_url(self, article):
        return article.get('guid',  None)