Sports Illustrated - Page 2

jsl21 · 03-17-2011, 05:03 PM

Thanks Spedinfargo!

For some reason when I tried yours, the articles were blank.

I did my own, not-so-robust version that depends on the current issue being the 11th cover on the page. Not a great solution but works for now (and keep in mind I'm new at this)

Code:

from calibre.web.feeds.recipes import BasicNewsRecipe
#from calibre.ebooks.BeautifulSoup import BeautifulSoup
from urllib import quote

class SportsIllustratedRecipe(BasicNewsRecipe) :
    __author__  = 'kwetal'
    __copyright__ = 'kwetal'
    __license__ = 'GPL v3'
    language = 'en'
    description = 'Sports Illustrated'
    version = 3
    title          = u'Sports Illustrated v2'

    no_stylesheets = True
    remove_javascript = True
    use_embedded_content   = False

    INDEX = 'http://sportsillustrated.cnn.com/'
    INDEX2 = 'http://sportsillustrated.cnn.com/vault/cover/home/index.htm'

    def parse_index(self):
        answer = []
        soup = self.index_to_soup(self.INDEX2)
        # print soup
        # Find the link to the current issue on the front page. SI Cover
        cover = soup.findAll('img', attrs = {'alt' : 'Read All Articles'})
        currentIssue = 'http://sportsillustrated.cnn.com/' + cover[10].parent['href']
        if currentIssue:
           index = self.index_to_soup(currentIssue)
           self.log('\tLooking for current issue in: ' + currentIssue)
           nav = index.find('div', attrs = {'class': 'siv_trav_top'})
           if nav:
                    img = nav.find('img', attrs = {'src': 'http://i.cdn.turner.com/sivault/.element/img/1.0/btn_next_v2.jpg'})
                    if img:
                        parent = img.parent
           list = index.find('div', attrs = {'class' : 'siv_artList'})
           if list:
                  
                    articles = []
          
                    for headline in list.findAll('div', attrs = {'class' : 'headline'}):
                        title = self.tag_to_string(headline.a) + '\n' + self.tag_to_string(headline.findNextSibling('div', attrs = {'class' : 'info'}))
                        url = self.INDEX + headline.a['href']
                        description = self.tag_to_string(headline.findNextSibling('a').div)
                        article = {'title' : title, 'date' : u'', 'url'  : url, 'description' : description}

                        articles.append(article)

                    # See if we can find a meaningfull title
                    feedTitle = 'Current Issue'
                    hasTitle = index.find('div', attrs = {'class' : 'siv_imageText_head'})
                    if hasTitle :
                        feedTitle = self.tag_to_string(hasTitle.h1)

                    answer.append([feedTitle, articles])

        return answer


    def print_version(self, url) :
        # This is the url and the parameters that work to get the print version.
        printUrl = 'http://si.printthis.clickability.com/pt/printThis?clickMap=printThis'
        printUrl += '&fb=Y&partnerID=2356&url=' + quote(url)

        return printUrl

        # However the original javascript also uses the following parameters, but they can be left out:
        #   title : can be some random string
        #   random : some random number, but I think the number of digits is important
        #   expire : no idea what value to use
        # All this comes from the Javascript function that redirects to the print version. It's called PT() and is defined in the file 48.js

    '''def preprocess_html(self, soup):
        header = soup.find('div', attrs = {'class' : 'siv_artheader'})
        homeMadeSoup = BeautifulSoup('<html><head></head><body></body></html>')
        body = homeMadeSoup.body

        # Find the date, title and byline
        temp = header.find('td', attrs = {'class' : 'title'})
        if temp :
            date = temp.find('div', attrs = {'class' : 'date'})
            if date:
                body.append(date)
            if temp.h1:
                body.append(temp.h1)
            if temp.h2 :
                body.append(temp.h2)
            byline = temp.find('div', attrs = {'class' : 'byline'})
            if byline:
                body.append(byline)

        # Find the content
        for para in soup.findAll('div', attrs = {'class' : 'siv_artpara'}) :
            body.append(para)

        return homeMadeSoup
        '''

spedinfargo · 03-18-2011, 12:17 PM

Can you repost with Code tags around your code?

Yeah, my blank articles are coming into play because of the Clickability problem that I was kind of mentioning. It doesn't make sense why it doesn't work in mine, though, because neither you nor I changed the print_version function... I want to try yours out so I can see if I can narrow down the problem...

spedinfargo · 03-18-2011, 01:22 PM

Well, it's nice to know that in all my years of hacking around with stuff like this that I can still have a huge "oh crap" moment. I got the INDEX and INDEX2 mixed around - well, I didn't realize that I still needed the original INDEX to generate the URL of the specific articles... oops.

This will work better - and I got rid of the max of 5 articles that I put in there for testing. Assuming no other breaking bugs (ha!) I think it makes sense to loop through all of the issues in that "latest" row... just in case a situation comes up like yesterday where they put the cover on that page before they put a TOC in the actual issue itself. This way you should be guaranteed of getting a full issue when the recipe runs.

Thanks for posting one that worked so I could find that stupid bug instead of just giving up and blaming it on clickability!

spedinfargo · 03-18-2011, 01:22 PM

Code:

from calibre.web.feeds.recipes import BasicNewsRecipe
#from calibre.ebooks.BeautifulSoup import BeautifulSoup
from urllib import quote
import re

class SportsIllustratedRecipe(BasicNewsRecipe) :
    __author__  = 'kwetal'
    __copyright__ = 'kwetal'
    __license__ = 'GPL v3'
    language = 'en'
    description = 'Sports Illustrated'
    version = 3
    title          = u'Sports Illustrated'

    no_stylesheets = True
    remove_javascript = True
    use_embedded_content   = False

    INDEX = 'http://sportsillustrated.cnn.com/'
    INDEX2 = 'http://sportsillustrated.cnn.com/vault/cover/home/index.htm'

    def parse_index(self):
        answer = []
        soup = self.index_to_soup(self.INDEX2)

        #Loop through all of the "latest" covers until we find one that actually has articles
        for item in soup.findAll('div', attrs={'id': re.compile("ecomthumb_latest_*")}):
            regex = re.compile('ecomthumb_latest_(\d*)')
            result = regex.search(str(item))
            current_issue_number = str(result.group(1))
            current_issue_link = 'http://sportsillustrated.cnn.com/vault/cover/toc/' + current_issue_number + '/index.htm'
            self.log('Checking this link for a TOC:  ', current_issue_link)

            index = self.index_to_soup(current_issue_link)
            if index:
                if index.find('div', 'siv_noArticleMessage'):
                    self.log('No TOC for this one.  Skipping...')
                else:
                    self.log('Found a TOC...  Using this link')
                    break

        # Find all articles.
        list = index.find('div', attrs = {'class' : 'siv_artList'})
        if list:
            self.log ('found siv_artList')
            articles = []
            # Get all the artcles ready for calibre.
            counter = 0
            for headline in list.findAll('div', attrs = {'class' : 'headline'}):
                counter = counter + 1
                title = self.tag_to_string(headline.a) + '\n' + self.tag_to_string(headline.findNextSibling('div', attrs = {'class' : 'info'}))
                url = self.INDEX + headline.a['href']
                description = self.tag_to_string(headline.findNextSibling('a').div)
                article = {'title' : title, 'date' : u'', 'url'  : url, 'description' : description}
                articles.append(article)
                #if counter > 5:
                    #break

            # See if we can find a meaningfull title
            feedTitle = 'Current Issue'
            hasTitle = index.find('div', attrs = {'class' : 'siv_imageText_head'})
            if hasTitle :
                feedTitle = self.tag_to_string(hasTitle.h1)

            answer.append([feedTitle, articles])

        return answer


    def print_version(self, url) :
        # This is the url and the parameters that work to get the print version.
        printUrl = 'http://si.printthis.clickability.com/pt/printThis?clickMap=printThis'
        printUrl += '&fb=Y&partnerID=2356&url=' + quote(url)
        return printUrl

        # However the original javascript also uses the following parameters, but they can be left out:
        #   title : can be some random string
        #   random : some random number, but I think the number of digits is important
        #   expire : no idea what value to use
        # All this comes from the Javascript function that redirects to the print version. It's called PT() and is defined in the file 48.js

    '''def preprocess_html(self, soup):
        header = soup.find('div', attrs = {'class' : 'siv_artheader'})
        homeMadeSoup = BeautifulSoup('<html><head></head><body></body></html>')
        body = homeMadeSoup.body

        # Find the date, title and byline
        temp = header.find('td', attrs = {'class' : 'title'})
        if temp :
            date = temp.find('div', attrs = {'class' : 'date'})
            if date:
                body.append(date)
            if temp.h1:
                body.append(temp.h1)
            if temp.h2 :
                body.append(temp.h2)
            byline = temp.find('div', attrs = {'class' : 'byline'})
            if byline:
                body.append(byline)

        # Find the content
        for para in soup.findAll('div', attrs = {'class' : 'siv_artpara'}) :
            body.append(para)

        return homeMadeSoup
        '''

Starson17 · 03-18-2011, 01:40 PM

Quote:

Originally Posted by spedinfargo

Can you repost with Code tags around your code?

It is important for people to use code tags, but when they don't, a tip is that the indents are still there, just suppressed in the display here. You can see them by quoting the message, as though you are going to reply. The indents will reappear and you can copy them off for your recipe, then exit the reply without submitting it.

spedinfargo · 03-18-2011, 02:44 PM

Great tip - that's even better than the "view source" that I ended up getting it from...

spedinfargo · 03-18-2011, 02:44 PM

Another quick update for testing. Added a cover image and got rid of extra junk in the articles.

Code:

See next post.

spedinfargo · 03-18-2011, 02:49 PM

Removed the 2-article limit (testing).

Code:

from calibre.web.feeds.recipes import BasicNewsRecipe
#from calibre.ebooks.BeautifulSoup import BeautifulSoup
from urllib import quote
import re

class SportsIllustratedRecipe(BasicNewsRecipe) :
    __author__  = 'kwetal'
    __copyright__ = 'kwetal'
    __license__ = 'GPL v3'
    language = 'en'
    description = 'Sports Illustrated'
    version = 4
    title          = u'Sports Illustrated'

    no_stylesheets = True
    remove_javascript = True
    use_embedded_content   = False

    preprocess_regexps = [
       (re.compile(r'<body.*<!--Article Goes Here-->', re.DOTALL|re.IGNORECASE),
        lambda match: '<body>'),

       (re.compile(r'<!--Article End-->.*</body>', re.DOTALL|re.IGNORECASE),
        lambda match: '</body>'),
       
    ]

    INDEX = 'http://sportsillustrated.cnn.com/'
    INDEX2 = 'http://sportsillustrated.cnn.com/vault/cover/home/index.htm'


    def parse_index(self):
        answer = []
        soup = self.index_to_soup(self.INDEX2)

        #Loop through all of the "latest" covers until we find one that actually has articles
        for item in soup.findAll('div', attrs={'id': re.compile("ecomthumb_latest_*")}):
            regex = re.compile('ecomthumb_latest_(\d*)')
            result = regex.search(str(item))
            current_issue_number = str(result.group(1))
            current_issue_link = 'http://sportsillustrated.cnn.com/vault/cover/toc/' + current_issue_number + '/index.htm'
            self.log('Checking this link for a TOC:  ', current_issue_link)

            index = self.index_to_soup(current_issue_link)
            if index:
                if index.find('div', 'siv_noArticleMessage'):
                    self.log('No TOC for this one.  Skipping...')
                else:
                    self.log('Found a TOC...  Using this link')
                    regex = re.compile('(http://i.cdn.turner.com/sivault/si_online/covers/images.*jpg)')
                    result = regex.search(str(index))
                    if result:
                        self.log('Found Image: ', result.group(1))
                        self.cover_url = result.group(1).replace('mid', 'large')

                    break

        # Find all articles.
        list = index.find('div', attrs = {'class' : 'siv_artList'})
        if list:
            self.log ('found siv_artList')
            articles = []
            # Get all the artcles ready for calibre.
            counter = 0
            for headline in list.findAll('div', attrs = {'class' : 'headline'}):
                counter = counter + 1
                title = self.tag_to_string(headline.a) + '\n' + self.tag_to_string(headline.findNextSibling('div', attrs = {'class' : 'info'}))
                url = self.INDEX + headline.a['href']
                description = self.tag_to_string(headline.findNextSibling('a').div)
                article = {'title' : title, 'date' : u'', 'url'  : url, 'description' : description}
                articles.append(article)
                #uncomment for test
                #if counter > 2:
                    #break

            # See if we can find a meaningfull title
            feedTitle = 'Current Issue'
            hasTitle = index.find('div', attrs = {'class' : 'siv_imageText_head'})
            if hasTitle :
                feedTitle = self.tag_to_string(hasTitle.h1)

            answer.append([feedTitle, articles])

        return answer


    def print_version(self, url) :
        # This is the url and the parameters that work to get the print version.
        printUrl = 'http://si.printthis.clickability.com/pt/printThis?clickMap=printThis'
        printUrl += '&fb=Y&partnerID=2356&url=' + quote(url)
        return printUrl

        # However the original javascript also uses the following parameters, but they can be left out:
        #   title : can be some random string
        #   random : some random number, but I think the number of digits is important
        #   expire : no idea what value to use
        # All this comes from the Javascript function that redirects to the print version. It's called PT() and is defined in the file 48.js

    '''def preprocess_html(self, soup):
        header = soup.find('div', attrs = {'class' : 'siv_artheader'})
        homeMadeSoup = BeautifulSoup('<html><head></head><body></body></html>')
        body = homeMadeSoup.body

        # Find the date, title and byline
        temp = header.find('td', attrs = {'class' : 'title'})
        if temp :
            date = temp.find('div', attrs = {'class' : 'date'})
            if date:
                body.append(date)
            if temp.h1:
                body.append(temp.h1)
            if temp.h2 :
                body.append(temp.h2)
            byline = temp.find('div', attrs = {'class' : 'byline'})
            if byline:
                body.append(byline)

        # Find the content
        for para in soup.findAll('div', attrs = {'class' : 'siv_artpara'}) :
            body.append(para)

        return homeMadeSoup
        '''

BillD · 04-08-2011, 04:21 AM

Great that it's working again.

Can I customise it to return more than 100 articles?

NSILMike · 11-15-2013, 11:13 AM

Quote:

Originally Posted by Starson17

Yes.
Do something like:

Code:

INDEX2 = 'http://sportsillustrated.cnn.com/vault/cover/home/index.htm'

followed by changing

Code:

soup = self.index_to_soup(self.INDEX)

to

Code:

soup = self.index_to_soup(self.INDEX2)

in parse_index
Than change

Code:

        cover = soup.find('div', attrs = {'alt' : 'Read All Articles', 'style' : 'vertical-align:bottom;'})
        if cover:
            currentIssue = cover.parent['href']

to whatever is needed to produce the currentIssue.

I think the old problem may have cropped up again? Or a new one?
I use the recipe created by kwetal. It stopped working a month or more ago- it downloads successfully except that it is stuck on the September 9, 2013 issue.

spedinfargo · 11-25-2013, 01:10 PM

Yep - I noticed that it is frozen in time again. I'll try and get a few minutes over Thanksgiving weekend to play around with it again... been a while ;-)

NSILMike · 11-29-2013, 08:31 AM

Quote:

Originally Posted by spedinfargo

Yep - I noticed that it is frozen in time again. I'll try and get a few minutes over Thanksgiving weekend to play around with it again... been a while ;-)

Fixed! Many thanks, and hope your Thanksgiving was good.

spedinfargo · 12-06-2013, 03:42 PM

Funny - I didn't do anything... something just started working again on the SI site I guess... ?

NSILMike · 12-06-2013, 03:46 PM

Quote:

Originally Posted by spedinfargo

Funny - I didn't do anything... something just started working again on the SI site I guess... ?

Shhh..... I won't tell anyone... you can take credit!

NSILMike · 01-29-2014, 10:23 AM

Quote:

Originally Posted by NSILMike

I think the old problem may have cropped up again? Or a new one?
I use the recipe created by kwetal. It stopped working a month or more ago- it downloads successfully except that it is stuck on the September 9, 2013 issue.

It worked for a while, now it is stuck at December 23rd...

03-18-2011, 02:44 PM	#22
spedinfargo Groupie Posts: 155 Karma: 106422 Join Date: Nov 2010 Device: none	Another quick update for testing. Added a cover image and got rid of extra junk in the articles. Code: See next post. Last edited by spedinfargo; 03-18-2011 at 02:48 PM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
So, any of you into sports?	Manichean	Lounge	43	12-15-2010 07:51 AM
iPad NYT: Sports Illustrated Introduces iPad App	kjk	Apple Devices	1	06-25-2010 03:56 AM
Sports Illustrated Dazzling Tablet Device	Daithi	News	20	12-04-2009 08:24 PM
Sports Illustrated Feeds	geneaber	Calibre	18	11-30-2009 12:08 PM

03-18-2011, 12:17 PM	#17
spedinfargo Groupie Posts: 155 Karma: 106422 Join Date: Nov 2010 Device: none	Can you repost with Code tags around your code? Yeah, my blank articles are coming into play because of the Clickability problem that I was kind of mentioning. It doesn't make sense why it doesn't work in mine, though, because neither you nor I changed the print_version function... I want to try yours out so I can see if I can narrow down the problem...

03-18-2011, 01:22 PM	#18
spedinfargo Groupie Posts: 155 Karma: 106422 Join Date: Nov 2010 Device: none	Well, it's nice to know that in all my years of hacking around with stuff like this that I can still have a huge "oh crap" moment. I got the INDEX and INDEX2 mixed around - well, I didn't realize that I still needed the original INDEX to generate the URL of the specific articles... oops. This will work better - and I got rid of the max of 5 articles that I put in there for testing. Assuming no other breaking bugs (ha!) I think it makes sense to loop through all of the issues in that "latest" row... just in case a situation comes up like yesterday where they put the cover on that page before they put a TOC in the actual issue itself. This way you should be guaranteed of getting a full issue when the recipe runs. Thanks for posting one that worked so I could find that stupid bug instead of just giving up and blaming it on clickability!

04-08-2011, 04:21 AM	#24
BillD Member Posts: 17 Karma: 10 Join Date: Sep 2010 Device: Kindle	Great that it's working again. Can I customise it to return more than 100 articles?

11-25-2013, 01:10 PM	#26
spedinfargo Groupie Posts: 155 Karma: 106422 Join Date: Nov 2010 Device: none	Yep - I noticed that it is frozen in time again. I'll try and get a few minutes over Thanksgiving weekend to play around with it again... been a while ;-)

12-06-2013, 03:42 PM	#28
spedinfargo Groupie Posts: 155 Karma: 106422 Join Date: Nov 2010 Device: none	Funny - I didn't do anything... something just started working again on the SI site I guess... ?