01-15-2011, 02:50 PM | #1 |
Member
Posts: 17
Karma: 10
Join Date: Dec 2010
Device: Kindle
|
Sports Illustrated
The wonderful "Sports Illustrated" started failing for me 2 weeks ago.
It worked brilliantly before that - but now I only get 2 blank pages each time. Its been the same for the last few versions of Calibre. Is it working for anyone? Many thanks in advance for any help. |
01-15-2011, 06:54 PM | #2 |
creator of calibre
Posts: 43,778
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
It's likely that the website changed for the new year so the recipe will have to be modified. While I do not have the time to fix other people's recipes, I had a quick look and committed a partial fix. You should get something if you try it now.
Last edited by kovidgoyal; 01-15-2011 at 07:00 PM. |
Advert | |
|
01-16-2011, 12:58 PM | #3 |
Member
Posts: 17
Karma: 10
Join Date: Dec 2010
Device: Kindle
|
Thanks a million for that.
I donated a few weeks ago - great support - well worth a more regular donation. |
02-18-2011, 07:41 PM | #4 |
Member
Posts: 16
Karma: 10
Join Date: Sep 2010
Device: Kindle
|
SI fetch still not working for me with 0.7.46 ... any solution out there?
Thanks. |
02-19-2011, 07:54 PM | #5 | |
Groupie
Posts: 155
Karma: 106422
Join Date: Nov 2010
Device: none
|
Quote:
Crappy deal - someone put in a LOT of work on this recpie... All in the name of progress I guess... |
|
Advert | |
|
02-20-2011, 09:42 PM | #6 |
Member
Posts: 16
Karma: 10
Join Date: Sep 2010
Device: Kindle
|
@sped ... thanks for info
|
03-10-2011, 09:46 AM | #7 |
Member
Posts: 17
Karma: 10
Join Date: May 2010
Device: Kindle
|
Workaround for Sports Illustrated
It turns out that the old infrastructure is still on the si.com website, it is just difficult to navigate there from the front page.
http://sportsillustrated.cnn.com/vau...1541/index.htm If you alter the recipe to so that currentIssue='http://sportsillustrated.cnn.com/vault/cover/toc/11541/index.htm' you will get this issue. I just haven't be able to figure out how to fix the recipe to always get the latest issue. I assume next week I will just need to change the 11541 to 11542 manually. |
03-10-2011, 10:22 AM | #8 |
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
|
03-11-2011, 10:45 AM | #9 | |
Groupie
Posts: 155
Karma: 106422
Join Date: Nov 2010
Device: none
|
Quote:
http://sportsillustrated.cnn.com/vau...home/index.htm It should always be the first link that looks like this: <div id="ecomthumb_latest_11541"></div> Is it possible to do a "two-step" process like this? |
|
03-11-2011, 11:24 AM | #10 |
creator of calibre
Posts: 43,778
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Yes it is, you can have as many steps as you like in parse_index.
|
03-11-2011, 11:28 AM | #11 | |
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Quote:
Do something like: Code:
INDEX2 = 'http://sportsillustrated.cnn.com/vault/cover/home/index.htm' Code:
soup = self.index_to_soup(self.INDEX) Code:
soup = self.index_to_soup(self.INDEX2) Than change Code:
cover = soup.find('div', attrs = {'alt' : 'Read All Articles', 'style' : 'vertical-align:bottom;'}) if cover: currentIssue = cover.parent['href'] Last edited by Starson17; 03-11-2011 at 11:42 AM. |
|
03-14-2011, 11:20 AM | #12 |
Member
Posts: 17
Karma: 10
Join Date: May 2010
Device: Kindle
|
I saw that index page that has all of the covers including the row that says Latest but can't figure out how to identify the most recent issue as it is identified by '11541' and will change every week.
Anyone know how to change the script to point to the first cover in the third row from that page assuming that will always be the location of the most recent issue? |
03-14-2011, 11:33 AM | #13 | |
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Quote:
|
|
03-16-2011, 02:45 PM | #14 |
Groupie
Posts: 155
Karma: 106422
Join Date: Nov 2010
Device: none
|
OK, I fixed the "getting the correct TOC page" issue. Interestingly enough, I was doing this right when SI was rolling out 6 different versions of the same issue for the NCAA tourney so it was kind of weird to test.
PROBLEM: The print_version is broken now. I think Clickability is doing some things to make it more difficult to pull down from their site. This might be what I've been seeing with other recipes as well. I'm going to start a new thread for that issue, but here's what I have so far. |
03-16-2011, 02:46 PM | #15 |
Groupie
Posts: 155
Karma: 106422
Join Date: Nov 2010
Device: none
|
Updated for new logic for pulling current issue URL:
Code:
from calibre.web.feeds.recipes import BasicNewsRecipe #from calibre.ebooks.BeautifulSoup import BeautifulSoup from urllib import quote import re class SportsIllustratedRecipe(BasicNewsRecipe) : __author__ = 'kwetal' __copyright__ = 'kwetal' __license__ = 'GPL v3' language = 'en' description = 'Sports Illustrated' version = 3 title = u'Sports Illustrated' no_stylesheets = True remove_javascript = True use_embedded_content = False INDEX = 'http://sportsillustrated.cnn.com/vault/cover/home/index.htm' def parse_index(self): answer = [] soup = self.index_to_soup(self.INDEX) #Loop through all of the "latest" covers until we find one that actually has articles for item in soup.findAll('div', attrs={'id': re.compile("ecomthumb_latest_*")}): regex = re.compile('ecomthumb_latest_(\d*)') result = regex.search(str(item)) current_issue_number = str(result.group(1)) current_issue_link = 'http://sportsillustrated.cnn.com/vault/cover/toc/' + current_issue_number + '/index.htm' self.log('Checking this link for a TOC: ', current_issue_link) index = self.index_to_soup(current_issue_link) if index: if index.find('div', 'siv_noArticleMessage'): self.log('No TOC for this one. Skipping...') else: self.log('Found a TOC... Using this link') break # Find all articles. list = index.find('div', attrs = {'class' : 'siv_artList'}) if list: self.log ('found siv_artList') articles = [] # Get all the artcles ready for calibre. counter = 0 for headline in list.findAll('div', attrs = {'class' : 'headline'}): counter = counter + 1 title = self.tag_to_string(headline.a) + '\n' + self.tag_to_string(headline.findNextSibling('div', attrs = {'class' : 'info'})) url = self.INDEX + headline.a['href'] description = self.tag_to_string(headline.findNextSibling('a').div) article = {'title' : title, 'date' : u'', 'url' : url, 'description' : description} articles.append(article) if counter > 5: break # See if we can find a meaningfull title feedTitle = 'Current Issue' hasTitle = index.find('div', attrs = {'class' : 'siv_imageText_head'}) if hasTitle : feedTitle = self.tag_to_string(hasTitle.h1) answer.append([feedTitle, articles]) return answer def print_version(self, url) : # This is the url and the parameters that work to get the print version. printUrl = 'http://si.printthis.clickability.com/pt/printThis?clickMap=printThis' printUrl += '&fb=Y&partnerID=2356&url=' + quote(url) self.log('PrintURL: ' , printUrl) return printUrl # However the original javascript also uses the following parameters, but they can be left out: # title : can be some random string # random : some random number, but I think the number of digits is important # expire : no idea what value to use # All this comes from the Javascript function that redirects to the print version. It's called PT() and is defined in the file 48.js '''def preprocess_html(self, soup): header = soup.find('div', attrs = {'class' : 'siv_artheader'}) homeMadeSoup = BeautifulSoup('<html><head></head><body></body></html>') body = homeMadeSoup.body # Find the date, title and byline temp = header.find('td', attrs = {'class' : 'title'}) if temp : date = temp.find('div', attrs = {'class' : 'date'}) if date: body.append(date) if temp.h1: body.append(temp.h1) if temp.h2 : body.append(temp.h2) byline = temp.find('div', attrs = {'class' : 'byline'}) if byline: body.append(byline) # Find the content for para in soup.findAll('div', attrs = {'class' : 'siv_artpara'}) : body.append(para) return homeMadeSoup ''' |
Thread Tools | Search this Thread |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
So, any of you into sports? | Manichean | Lounge | 43 | 12-15-2010 07:51 AM |
iPad NYT: Sports Illustrated Introduces iPad App | kjk | Apple Devices | 1 | 06-25-2010 03:56 AM |
Sports Illustrated Dazzling Tablet Device | Daithi | News | 20 | 12-04-2009 08:24 PM |
Sports Illustrated Feeds | geneaber | Calibre | 18 | 11-30-2009 12:08 PM |