Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 03-17-2011, 06:03 PM   #16
jsl21
Member
jsl21 began at the beginning.
 
Posts: 17
Karma: 10
Join Date: May 2010
Device: Kindle
Thanks Spedinfargo!

For some reason when I tried yours, the articles were blank.

I did my own, not-so-robust version that depends on the current issue being the 11th cover on the page. Not a great solution but works for now (and keep in mind I'm new at this)

Code:
from calibre.web.feeds.recipes import BasicNewsRecipe
#from calibre.ebooks.BeautifulSoup import BeautifulSoup
from urllib import quote

class SportsIllustratedRecipe(BasicNewsRecipe) :
    __author__  = 'kwetal'
    __copyright__ = 'kwetal'
    __license__ = 'GPL v3'
    language = 'en'
    description = 'Sports Illustrated'
    version = 3
    title          = u'Sports Illustrated v2'

    no_stylesheets = True
    remove_javascript = True
    use_embedded_content   = False

    INDEX = 'http://sportsillustrated.cnn.com/'
    INDEX2 = 'http://sportsillustrated.cnn.com/vault/cover/home/index.htm'

    def parse_index(self):
        answer = []
        soup = self.index_to_soup(self.INDEX2)
        # print soup
        # Find the link to the current issue on the front page. SI Cover
        cover = soup.findAll('img', attrs = {'alt' : 'Read All Articles'})
        currentIssue = 'http://sportsillustrated.cnn.com/' + cover[10].parent['href']
        if currentIssue:
           index = self.index_to_soup(currentIssue)
           self.log('\tLooking for current issue in: ' + currentIssue)
           nav = index.find('div', attrs = {'class': 'siv_trav_top'})
           if nav:
                    img = nav.find('img', attrs = {'src': 'http://i.cdn.turner.com/sivault/.element/img/1.0/btn_next_v2.jpg'})
                    if img:
                        parent = img.parent
           list = index.find('div', attrs = {'class' : 'siv_artList'})
           if list:
                  
                    articles = []
          
                    for headline in list.findAll('div', attrs = {'class' : 'headline'}):
                        title = self.tag_to_string(headline.a) + '\n' + self.tag_to_string(headline.findNextSibling('div', attrs = {'class' : 'info'}))
                        url = self.INDEX + headline.a['href']
                        description = self.tag_to_string(headline.findNextSibling('a').div)
                        article = {'title' : title, 'date' : u'', 'url'  : url, 'description' : description}

                        articles.append(article)

                    # See if we can find a meaningfull title
                    feedTitle = 'Current Issue'
                    hasTitle = index.find('div', attrs = {'class' : 'siv_imageText_head'})
                    if hasTitle :
                        feedTitle = self.tag_to_string(hasTitle.h1)

                    answer.append([feedTitle, articles])

        return answer


    def print_version(self, url) :
        # This is the url and the parameters that work to get the print version.
        printUrl = 'http://si.printthis.clickability.com/pt/printThis?clickMap=printThis'
        printUrl += '&fb=Y&partnerID=2356&url=' + quote(url)

        return printUrl

        # However the original javascript also uses the following parameters, but they can be left out:
        #   title : can be some random string
        #   random : some random number, but I think the number of digits is important
        #   expire : no idea what value to use
        # All this comes from the Javascript function that redirects to the print version. It's called PT() and is defined in the file 48.js

    '''def preprocess_html(self, soup):
        header = soup.find('div', attrs = {'class' : 'siv_artheader'})
        homeMadeSoup = BeautifulSoup('<html><head></head><body></body></html>')
        body = homeMadeSoup.body

        # Find the date, title and byline
        temp = header.find('td', attrs = {'class' : 'title'})
        if temp :
            date = temp.find('div', attrs = {'class' : 'date'})
            if date:
                body.append(date)
            if temp.h1:
                body.append(temp.h1)
            if temp.h2 :
                body.append(temp.h2)
            byline = temp.find('div', attrs = {'class' : 'byline'})
            if byline:
                body.append(byline)

        # Find the content
        for para in soup.findAll('div', attrs = {'class' : 'siv_artpara'}) :
            body.append(para)

        return homeMadeSoup
        '''

Last edited by kovidgoyal; 03-18-2011 at 01:20 PM.
jsl21 is offline   Reply With Quote
Old 03-18-2011, 01:17 PM   #17
spedinfargo
Zealot
spedinfargo knows more than wikipediaspedinfargo knows more than wikipediaspedinfargo knows more than wikipediaspedinfargo knows more than wikipediaspedinfargo knows more than wikipediaspedinfargo knows more than wikipediaspedinfargo knows more than wikipediaspedinfargo knows more than wikipediaspedinfargo knows more than wikipediaspedinfargo knows more than wikipediaspedinfargo knows more than wikipedia
 
Posts: 120
Karma: 47540
Join Date: Nov 2010
Device: none
Can you repost with Code tags around your code?

Yeah, my blank articles are coming into play because of the Clickability problem that I was kind of mentioning. It doesn't make sense why it doesn't work in mine, though, because neither you nor I changed the print_version function... I want to try yours out so I can see if I can narrow down the problem...
spedinfargo is offline   Reply With Quote
Old 03-18-2011, 02:22 PM   #18
spedinfargo
Zealot
spedinfargo knows more than wikipediaspedinfargo knows more than wikipediaspedinfargo knows more than wikipediaspedinfargo knows more than wikipediaspedinfargo knows more than wikipediaspedinfargo knows more than wikipediaspedinfargo knows more than wikipediaspedinfargo knows more than wikipediaspedinfargo knows more than wikipediaspedinfargo knows more than wikipediaspedinfargo knows more than wikipedia
 
Posts: 120
Karma: 47540
Join Date: Nov 2010
Device: none
Well, it's nice to know that in all my years of hacking around with stuff like this that I can still have a huge "oh crap" moment. I got the INDEX and INDEX2 mixed around - well, I didn't realize that I still needed the original INDEX to generate the URL of the specific articles... oops.

This will work better - and I got rid of the max of 5 articles that I put in there for testing. Assuming no other breaking bugs (ha!) I think it makes sense to loop through all of the issues in that "latest" row... just in case a situation comes up like yesterday where they put the cover on that page before they put a TOC in the actual issue itself. This way you should be guaranteed of getting a full issue when the recipe runs.

Thanks for posting one that worked so I could find that stupid bug instead of just giving up and blaming it on clickability!
spedinfargo is offline   Reply With Quote
Old 03-18-2011, 02:22 PM   #19
spedinfargo
Zealot
spedinfargo knows more than wikipediaspedinfargo knows more than wikipediaspedinfargo knows more than wikipediaspedinfargo knows more than wikipediaspedinfargo knows more than wikipediaspedinfargo knows more than wikipediaspedinfargo knows more than wikipediaspedinfargo knows more than wikipediaspedinfargo knows more than wikipediaspedinfargo knows more than wikipediaspedinfargo knows more than wikipedia
 
Posts: 120
Karma: 47540
Join Date: Nov 2010
Device: none
Code:
from calibre.web.feeds.recipes import BasicNewsRecipe
#from calibre.ebooks.BeautifulSoup import BeautifulSoup
from urllib import quote
import re

class SportsIllustratedRecipe(BasicNewsRecipe) :
    __author__  = 'kwetal'
    __copyright__ = 'kwetal'
    __license__ = 'GPL v3'
    language = 'en'
    description = 'Sports Illustrated'
    version = 3
    title          = u'Sports Illustrated'

    no_stylesheets = True
    remove_javascript = True
    use_embedded_content   = False

    INDEX = 'http://sportsillustrated.cnn.com/'
    INDEX2 = 'http://sportsillustrated.cnn.com/vault/cover/home/index.htm'

    def parse_index(self):
        answer = []
        soup = self.index_to_soup(self.INDEX2)

        #Loop through all of the "latest" covers until we find one that actually has articles
        for item in soup.findAll('div', attrs={'id': re.compile("ecomthumb_latest_*")}):
            regex = re.compile('ecomthumb_latest_(\d*)')
            result = regex.search(str(item))
            current_issue_number = str(result.group(1))
            current_issue_link = 'http://sportsillustrated.cnn.com/vault/cover/toc/' + current_issue_number + '/index.htm'
            self.log('Checking this link for a TOC:  ', current_issue_link)

            index = self.index_to_soup(current_issue_link)
            if index:
                if index.find('div', 'siv_noArticleMessage'):
                    self.log('No TOC for this one.  Skipping...')
                else:
                    self.log('Found a TOC...  Using this link')
                    break

        # Find all articles.
        list = index.find('div', attrs = {'class' : 'siv_artList'})
        if list:
            self.log ('found siv_artList')
            articles = []
            # Get all the artcles ready for calibre.
            counter = 0
            for headline in list.findAll('div', attrs = {'class' : 'headline'}):
                counter = counter + 1
                title = self.tag_to_string(headline.a) + '\n' + self.tag_to_string(headline.findNextSibling('div', attrs = {'class' : 'info'}))
                url = self.INDEX + headline.a['href']
                description = self.tag_to_string(headline.findNextSibling('a').div)
                article = {'title' : title, 'date' : u'', 'url'  : url, 'description' : description}
                articles.append(article)
                #if counter > 5:
                    #break

            # See if we can find a meaningfull title
            feedTitle = 'Current Issue'
            hasTitle = index.find('div', attrs = {'class' : 'siv_imageText_head'})
            if hasTitle :
                feedTitle = self.tag_to_string(hasTitle.h1)

            answer.append([feedTitle, articles])

        return answer


    def print_version(self, url) :
        # This is the url and the parameters that work to get the print version.
        printUrl = 'http://si.printthis.clickability.com/pt/printThis?clickMap=printThis'
        printUrl += '&fb=Y&partnerID=2356&url=' + quote(url)
        return printUrl

        # However the original javascript also uses the following parameters, but they can be left out:
        #   title : can be some random string
        #   random : some random number, but I think the number of digits is important
        #   expire : no idea what value to use
        # All this comes from the Javascript function that redirects to the print version. It's called PT() and is defined in the file 48.js

    '''def preprocess_html(self, soup):
        header = soup.find('div', attrs = {'class' : 'siv_artheader'})
        homeMadeSoup = BeautifulSoup('<html><head></head><body></body></html>')
        body = homeMadeSoup.body

        # Find the date, title and byline
        temp = header.find('td', attrs = {'class' : 'title'})
        if temp :
            date = temp.find('div', attrs = {'class' : 'date'})
            if date:
                body.append(date)
            if temp.h1:
                body.append(temp.h1)
            if temp.h2 :
                body.append(temp.h2)
            byline = temp.find('div', attrs = {'class' : 'byline'})
            if byline:
                body.append(byline)

        # Find the content
        for para in soup.findAll('div', attrs = {'class' : 'siv_artpara'}) :
            body.append(para)

        return homeMadeSoup
        '''
spedinfargo is offline   Reply With Quote
Old 03-18-2011, 02:40 PM   #20
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by spedinfargo View Post
Can you repost with Code tags around your code?
It is important for people to use code tags, but when they don't, a tip is that the indents are still there, just suppressed in the display here. You can see them by quoting the message, as though you are going to reply. The indents will reappear and you can copy them off for your recipe, then exit the reply without submitting it.

Last edited by Starson17; 03-18-2011 at 04:16 PM.
Starson17 is offline   Reply With Quote
Old 03-18-2011, 03:44 PM   #21
spedinfargo
Zealot
spedinfargo knows more than wikipediaspedinfargo knows more than wikipediaspedinfargo knows more than wikipediaspedinfargo knows more than wikipediaspedinfargo knows more than wikipediaspedinfargo knows more than wikipediaspedinfargo knows more than wikipediaspedinfargo knows more than wikipediaspedinfargo knows more than wikipediaspedinfargo knows more than wikipediaspedinfargo knows more than wikipedia
 
Posts: 120
Karma: 47540
Join Date: Nov 2010
Device: none
Great tip - that's even better than the "view source" that I ended up getting it from...
spedinfargo is offline   Reply With Quote
Old 03-18-2011, 03:44 PM   #22
spedinfargo
Zealot
spedinfargo knows more than wikipediaspedinfargo knows more than wikipediaspedinfargo knows more than wikipediaspedinfargo knows more than wikipediaspedinfargo knows more than wikipediaspedinfargo knows more than wikipediaspedinfargo knows more than wikipediaspedinfargo knows more than wikipediaspedinfargo knows more than wikipediaspedinfargo knows more than wikipediaspedinfargo knows more than wikipedia
 
Posts: 120
Karma: 47540
Join Date: Nov 2010
Device: none
Another quick update for testing. Added a cover image and got rid of extra junk in the articles.

Code:
See next post.

Last edited by spedinfargo; 03-18-2011 at 03:48 PM.
spedinfargo is offline   Reply With Quote
Old 03-18-2011, 03:49 PM   #23
spedinfargo
Zealot
spedinfargo knows more than wikipediaspedinfargo knows more than wikipediaspedinfargo knows more than wikipediaspedinfargo knows more than wikipediaspedinfargo knows more than wikipediaspedinfargo knows more than wikipediaspedinfargo knows more than wikipediaspedinfargo knows more than wikipediaspedinfargo knows more than wikipediaspedinfargo knows more than wikipediaspedinfargo knows more than wikipedia
 
Posts: 120
Karma: 47540
Join Date: Nov 2010
Device: none
Removed the 2-article limit (testing).

Code:
from calibre.web.feeds.recipes import BasicNewsRecipe
#from calibre.ebooks.BeautifulSoup import BeautifulSoup
from urllib import quote
import re

class SportsIllustratedRecipe(BasicNewsRecipe) :
    __author__  = 'kwetal'
    __copyright__ = 'kwetal'
    __license__ = 'GPL v3'
    language = 'en'
    description = 'Sports Illustrated'
    version = 4
    title          = u'Sports Illustrated'

    no_stylesheets = True
    remove_javascript = True
    use_embedded_content   = False

    preprocess_regexps = [
       (re.compile(r'<body.*<!--Article Goes Here-->', re.DOTALL|re.IGNORECASE),
        lambda match: '<body>'),

       (re.compile(r'<!--Article End-->.*</body>', re.DOTALL|re.IGNORECASE),
        lambda match: '</body>'),
       
    ]

    INDEX = 'http://sportsillustrated.cnn.com/'
    INDEX2 = 'http://sportsillustrated.cnn.com/vault/cover/home/index.htm'


    def parse_index(self):
        answer = []
        soup = self.index_to_soup(self.INDEX2)

        #Loop through all of the "latest" covers until we find one that actually has articles
        for item in soup.findAll('div', attrs={'id': re.compile("ecomthumb_latest_*")}):
            regex = re.compile('ecomthumb_latest_(\d*)')
            result = regex.search(str(item))
            current_issue_number = str(result.group(1))
            current_issue_link = 'http://sportsillustrated.cnn.com/vault/cover/toc/' + current_issue_number + '/index.htm'
            self.log('Checking this link for a TOC:  ', current_issue_link)

            index = self.index_to_soup(current_issue_link)
            if index:
                if index.find('div', 'siv_noArticleMessage'):
                    self.log('No TOC for this one.  Skipping...')
                else:
                    self.log('Found a TOC...  Using this link')
                    regex = re.compile('(http://i.cdn.turner.com/sivault/si_online/covers/images.*jpg)')
                    result = regex.search(str(index))
                    if result:
                        self.log('Found Image: ', result.group(1))
                        self.cover_url = result.group(1).replace('mid', 'large')

                    break

        # Find all articles.
        list = index.find('div', attrs = {'class' : 'siv_artList'})
        if list:
            self.log ('found siv_artList')
            articles = []
            # Get all the artcles ready for calibre.
            counter = 0
            for headline in list.findAll('div', attrs = {'class' : 'headline'}):
                counter = counter + 1
                title = self.tag_to_string(headline.a) + '\n' + self.tag_to_string(headline.findNextSibling('div', attrs = {'class' : 'info'}))
                url = self.INDEX + headline.a['href']
                description = self.tag_to_string(headline.findNextSibling('a').div)
                article = {'title' : title, 'date' : u'', 'url'  : url, 'description' : description}
                articles.append(article)
                #uncomment for test
                #if counter > 2:
                    #break

            # See if we can find a meaningfull title
            feedTitle = 'Current Issue'
            hasTitle = index.find('div', attrs = {'class' : 'siv_imageText_head'})
            if hasTitle :
                feedTitle = self.tag_to_string(hasTitle.h1)

            answer.append([feedTitle, articles])

        return answer


    def print_version(self, url) :
        # This is the url and the parameters that work to get the print version.
        printUrl = 'http://si.printthis.clickability.com/pt/printThis?clickMap=printThis'
        printUrl += '&fb=Y&partnerID=2356&url=' + quote(url)
        return printUrl

        # However the original javascript also uses the following parameters, but they can be left out:
        #   title : can be some random string
        #   random : some random number, but I think the number of digits is important
        #   expire : no idea what value to use
        # All this comes from the Javascript function that redirects to the print version. It's called PT() and is defined in the file 48.js

    '''def preprocess_html(self, soup):
        header = soup.find('div', attrs = {'class' : 'siv_artheader'})
        homeMadeSoup = BeautifulSoup('<html><head></head><body></body></html>')
        body = homeMadeSoup.body

        # Find the date, title and byline
        temp = header.find('td', attrs = {'class' : 'title'})
        if temp :
            date = temp.find('div', attrs = {'class' : 'date'})
            if date:
                body.append(date)
            if temp.h1:
                body.append(temp.h1)
            if temp.h2 :
                body.append(temp.h2)
            byline = temp.find('div', attrs = {'class' : 'byline'})
            if byline:
                body.append(byline)

        # Find the content
        for para in soup.findAll('div', attrs = {'class' : 'siv_artpara'}) :
            body.append(para)

        return homeMadeSoup
        '''
spedinfargo is offline   Reply With Quote
Old 04-08-2011, 05:21 AM   #24
BillD
Junior Member
BillD began at the beginning.
 
BillD's Avatar
 
Posts: 8
Karma: 10
Join Date: Sep 2010
Device: Kindle
Great that it's working again.

Can I customise it to return more than 100 articles?
BillD is offline   Reply With Quote
Old 11-15-2013, 12:13 PM   #25
NSILMike
Zealot
NSILMike began at the beginning.
 
Posts: 141
Karma: 10
Join Date: Apr 2011
Location: Suburb of Boston, MA
Device: Kindle KeyBoard & Nexus 7
Quote:
Originally Posted by Starson17 View Post
Yes.
Do something like:
Code:
INDEX2 = 'http://sportsillustrated.cnn.com/vault/cover/home/index.htm'
followed by changing
Code:
soup = self.index_to_soup(self.INDEX)
to
Code:
soup = self.index_to_soup(self.INDEX2)
in parse_index
Than change
Code:
        cover = soup.find('div', attrs = {'alt' : 'Read All Articles', 'style' : 'vertical-align:bottom;'})
        if cover:
            currentIssue = cover.parent['href']
to whatever is needed to produce the currentIssue.
I think the old problem may have cropped up again? Or a new one?
I use the recipe created by kwetal. It stopped working a month or more ago- it downloads successfully except that it is stuck on the September 9, 2013 issue.
NSILMike is offline   Reply With Quote
Old 11-25-2013, 02:10 PM   #26
spedinfargo
Zealot
spedinfargo knows more than wikipediaspedinfargo knows more than wikipediaspedinfargo knows more than wikipediaspedinfargo knows more than wikipediaspedinfargo knows more than wikipediaspedinfargo knows more than wikipediaspedinfargo knows more than wikipediaspedinfargo knows more than wikipediaspedinfargo knows more than wikipediaspedinfargo knows more than wikipediaspedinfargo knows more than wikipedia
 
Posts: 120
Karma: 47540
Join Date: Nov 2010
Device: none
Yep - I noticed that it is frozen in time again. I'll try and get a few minutes over Thanksgiving weekend to play around with it again... been a while ;-)
spedinfargo is offline   Reply With Quote
Old 11-29-2013, 09:31 AM   #27
NSILMike
Zealot
NSILMike began at the beginning.
 
Posts: 141
Karma: 10
Join Date: Apr 2011
Location: Suburb of Boston, MA
Device: Kindle KeyBoard & Nexus 7
Quote:
Originally Posted by spedinfargo View Post
Yep - I noticed that it is frozen in time again. I'll try and get a few minutes over Thanksgiving weekend to play around with it again... been a while ;-)
Fixed! Many thanks, and hope your Thanksgiving was good.
NSILMike is offline   Reply With Quote
Old 12-06-2013, 04:42 PM   #28
spedinfargo
Zealot
spedinfargo knows more than wikipediaspedinfargo knows more than wikipediaspedinfargo knows more than wikipediaspedinfargo knows more than wikipediaspedinfargo knows more than wikipediaspedinfargo knows more than wikipediaspedinfargo knows more than wikipediaspedinfargo knows more than wikipediaspedinfargo knows more than wikipediaspedinfargo knows more than wikipediaspedinfargo knows more than wikipedia
 
Posts: 120
Karma: 47540
Join Date: Nov 2010
Device: none
Funny - I didn't do anything... something just started working again on the SI site I guess... ?
spedinfargo is offline   Reply With Quote
Old 12-06-2013, 04:46 PM   #29
NSILMike
Zealot
NSILMike began at the beginning.
 
Posts: 141
Karma: 10
Join Date: Apr 2011
Location: Suburb of Boston, MA
Device: Kindle KeyBoard & Nexus 7
Quote:
Originally Posted by spedinfargo View Post
Funny - I didn't do anything... something just started working again on the SI site I guess... ?
Shhh..... I won't tell anyone... you can take credit!
NSILMike is offline   Reply With Quote
Old 01-29-2014, 11:23 AM   #30
NSILMike
Zealot
NSILMike began at the beginning.
 
Posts: 141
Karma: 10
Join Date: Apr 2011
Location: Suburb of Boston, MA
Device: Kindle KeyBoard & Nexus 7
Quote:
Originally Posted by NSILMike View Post
I think the old problem may have cropped up again? Or a new one?
I use the recipe created by kwetal. It stopped working a month or more ago- it downloads successfully except that it is stuck on the September 9, 2013 issue.
It worked for a while, now it is stuck at December 23rd...
NSILMike is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
So, any of you into sports? Manichean Lounge 43 12-15-2010 08:51 AM
iPad NYT: Sports Illustrated Introduces iPad App kjk Apple Devices 1 06-25-2010 04:56 AM
Sports Illustrated Dazzling Tablet Device Daithi News 20 12-04-2009 09:24 PM
Sports Illustrated Feeds geneaber Calibre 18 11-30-2009 01:08 PM


All times are GMT -4. The time now is 11:16 PM.


MobileRead.com is a privately owned, operated and funded community.