03-17-2011, 05:03 PM | #16 |
Member
Posts: 17
Karma: 10
Join Date: May 2010
Device: Kindle
|
Thanks Spedinfargo!
For some reason when I tried yours, the articles were blank. I did my own, not-so-robust version that depends on the current issue being the 11th cover on the page. Not a great solution but works for now (and keep in mind I'm new at this) Code:
from calibre.web.feeds.recipes import BasicNewsRecipe #from calibre.ebooks.BeautifulSoup import BeautifulSoup from urllib import quote class SportsIllustratedRecipe(BasicNewsRecipe) : __author__ = 'kwetal' __copyright__ = 'kwetal' __license__ = 'GPL v3' language = 'en' description = 'Sports Illustrated' version = 3 title = u'Sports Illustrated v2' no_stylesheets = True remove_javascript = True use_embedded_content = False INDEX = 'http://sportsillustrated.cnn.com/' INDEX2 = 'http://sportsillustrated.cnn.com/vault/cover/home/index.htm' def parse_index(self): answer = [] soup = self.index_to_soup(self.INDEX2) # print soup # Find the link to the current issue on the front page. SI Cover cover = soup.findAll('img', attrs = {'alt' : 'Read All Articles'}) currentIssue = 'http://sportsillustrated.cnn.com/' + cover[10].parent['href'] if currentIssue: index = self.index_to_soup(currentIssue) self.log('\tLooking for current issue in: ' + currentIssue) nav = index.find('div', attrs = {'class': 'siv_trav_top'}) if nav: img = nav.find('img', attrs = {'src': 'http://i.cdn.turner.com/sivault/.element/img/1.0/btn_next_v2.jpg'}) if img: parent = img.parent list = index.find('div', attrs = {'class' : 'siv_artList'}) if list: articles = [] for headline in list.findAll('div', attrs = {'class' : 'headline'}): title = self.tag_to_string(headline.a) + '\n' + self.tag_to_string(headline.findNextSibling('div', attrs = {'class' : 'info'})) url = self.INDEX + headline.a['href'] description = self.tag_to_string(headline.findNextSibling('a').div) article = {'title' : title, 'date' : u'', 'url' : url, 'description' : description} articles.append(article) # See if we can find a meaningfull title feedTitle = 'Current Issue' hasTitle = index.find('div', attrs = {'class' : 'siv_imageText_head'}) if hasTitle : feedTitle = self.tag_to_string(hasTitle.h1) answer.append([feedTitle, articles]) return answer def print_version(self, url) : # This is the url and the parameters that work to get the print version. printUrl = 'http://si.printthis.clickability.com/pt/printThis?clickMap=printThis' printUrl += '&fb=Y&partnerID=2356&url=' + quote(url) return printUrl # However the original javascript also uses the following parameters, but they can be left out: # title : can be some random string # random : some random number, but I think the number of digits is important # expire : no idea what value to use # All this comes from the Javascript function that redirects to the print version. It's called PT() and is defined in the file 48.js '''def preprocess_html(self, soup): header = soup.find('div', attrs = {'class' : 'siv_artheader'}) homeMadeSoup = BeautifulSoup('<html><head></head><body></body></html>') body = homeMadeSoup.body # Find the date, title and byline temp = header.find('td', attrs = {'class' : 'title'}) if temp : date = temp.find('div', attrs = {'class' : 'date'}) if date: body.append(date) if temp.h1: body.append(temp.h1) if temp.h2 : body.append(temp.h2) byline = temp.find('div', attrs = {'class' : 'byline'}) if byline: body.append(byline) # Find the content for para in soup.findAll('div', attrs = {'class' : 'siv_artpara'}) : body.append(para) return homeMadeSoup ''' Last edited by kovidgoyal; 03-18-2011 at 12:20 PM. |
03-18-2011, 12:17 PM | #17 |
Groupie
Posts: 155
Karma: 106422
Join Date: Nov 2010
Device: none
|
Can you repost with Code tags around your code?
Yeah, my blank articles are coming into play because of the Clickability problem that I was kind of mentioning. It doesn't make sense why it doesn't work in mine, though, because neither you nor I changed the print_version function... I want to try yours out so I can see if I can narrow down the problem... |
03-18-2011, 01:22 PM | #18 |
Groupie
Posts: 155
Karma: 106422
Join Date: Nov 2010
Device: none
|
Well, it's nice to know that in all my years of hacking around with stuff like this that I can still have a huge "oh crap" moment. I got the INDEX and INDEX2 mixed around - well, I didn't realize that I still needed the original INDEX to generate the URL of the specific articles... oops.
This will work better - and I got rid of the max of 5 articles that I put in there for testing. Assuming no other breaking bugs (ha!) I think it makes sense to loop through all of the issues in that "latest" row... just in case a situation comes up like yesterday where they put the cover on that page before they put a TOC in the actual issue itself. This way you should be guaranteed of getting a full issue when the recipe runs. Thanks for posting one that worked so I could find that stupid bug instead of just giving up and blaming it on clickability! |
03-18-2011, 01:22 PM | #19 |
Groupie
Posts: 155
Karma: 106422
Join Date: Nov 2010
Device: none
|
Code:
from calibre.web.feeds.recipes import BasicNewsRecipe #from calibre.ebooks.BeautifulSoup import BeautifulSoup from urllib import quote import re class SportsIllustratedRecipe(BasicNewsRecipe) : __author__ = 'kwetal' __copyright__ = 'kwetal' __license__ = 'GPL v3' language = 'en' description = 'Sports Illustrated' version = 3 title = u'Sports Illustrated' no_stylesheets = True remove_javascript = True use_embedded_content = False INDEX = 'http://sportsillustrated.cnn.com/' INDEX2 = 'http://sportsillustrated.cnn.com/vault/cover/home/index.htm' def parse_index(self): answer = [] soup = self.index_to_soup(self.INDEX2) #Loop through all of the "latest" covers until we find one that actually has articles for item in soup.findAll('div', attrs={'id': re.compile("ecomthumb_latest_*")}): regex = re.compile('ecomthumb_latest_(\d*)') result = regex.search(str(item)) current_issue_number = str(result.group(1)) current_issue_link = 'http://sportsillustrated.cnn.com/vault/cover/toc/' + current_issue_number + '/index.htm' self.log('Checking this link for a TOC: ', current_issue_link) index = self.index_to_soup(current_issue_link) if index: if index.find('div', 'siv_noArticleMessage'): self.log('No TOC for this one. Skipping...') else: self.log('Found a TOC... Using this link') break # Find all articles. list = index.find('div', attrs = {'class' : 'siv_artList'}) if list: self.log ('found siv_artList') articles = [] # Get all the artcles ready for calibre. counter = 0 for headline in list.findAll('div', attrs = {'class' : 'headline'}): counter = counter + 1 title = self.tag_to_string(headline.a) + '\n' + self.tag_to_string(headline.findNextSibling('div', attrs = {'class' : 'info'})) url = self.INDEX + headline.a['href'] description = self.tag_to_string(headline.findNextSibling('a').div) article = {'title' : title, 'date' : u'', 'url' : url, 'description' : description} articles.append(article) #if counter > 5: #break # See if we can find a meaningfull title feedTitle = 'Current Issue' hasTitle = index.find('div', attrs = {'class' : 'siv_imageText_head'}) if hasTitle : feedTitle = self.tag_to_string(hasTitle.h1) answer.append([feedTitle, articles]) return answer def print_version(self, url) : # This is the url and the parameters that work to get the print version. printUrl = 'http://si.printthis.clickability.com/pt/printThis?clickMap=printThis' printUrl += '&fb=Y&partnerID=2356&url=' + quote(url) return printUrl # However the original javascript also uses the following parameters, but they can be left out: # title : can be some random string # random : some random number, but I think the number of digits is important # expire : no idea what value to use # All this comes from the Javascript function that redirects to the print version. It's called PT() and is defined in the file 48.js '''def preprocess_html(self, soup): header = soup.find('div', attrs = {'class' : 'siv_artheader'}) homeMadeSoup = BeautifulSoup('<html><head></head><body></body></html>') body = homeMadeSoup.body # Find the date, title and byline temp = header.find('td', attrs = {'class' : 'title'}) if temp : date = temp.find('div', attrs = {'class' : 'date'}) if date: body.append(date) if temp.h1: body.append(temp.h1) if temp.h2 : body.append(temp.h2) byline = temp.find('div', attrs = {'class' : 'byline'}) if byline: body.append(byline) # Find the content for para in soup.findAll('div', attrs = {'class' : 'siv_artpara'}) : body.append(para) return homeMadeSoup ''' |
03-18-2011, 01:40 PM | #20 |
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
It is important for people to use code tags, but when they don't, a tip is that the indents are still there, just suppressed in the display here. You can see them by quoting the message, as though you are going to reply. The indents will reappear and you can copy them off for your recipe, then exit the reply without submitting it.
Last edited by Starson17; 03-18-2011 at 03:16 PM. |
03-18-2011, 02:44 PM | #21 |
Groupie
Posts: 155
Karma: 106422
Join Date: Nov 2010
Device: none
|
Great tip - that's even better than the "view source" that I ended up getting it from...
|
03-18-2011, 02:44 PM | #22 |
Groupie
Posts: 155
Karma: 106422
Join Date: Nov 2010
Device: none
|
Another quick update for testing. Added a cover image and got rid of extra junk in the articles.
Code:
See next post. Last edited by spedinfargo; 03-18-2011 at 02:48 PM. |
03-18-2011, 02:49 PM | #23 |
Groupie
Posts: 155
Karma: 106422
Join Date: Nov 2010
Device: none
|
Removed the 2-article limit (testing).
Code:
from calibre.web.feeds.recipes import BasicNewsRecipe #from calibre.ebooks.BeautifulSoup import BeautifulSoup from urllib import quote import re class SportsIllustratedRecipe(BasicNewsRecipe) : __author__ = 'kwetal' __copyright__ = 'kwetal' __license__ = 'GPL v3' language = 'en' description = 'Sports Illustrated' version = 4 title = u'Sports Illustrated' no_stylesheets = True remove_javascript = True use_embedded_content = False preprocess_regexps = [ (re.compile(r'<body.*<!--Article Goes Here-->', re.DOTALL|re.IGNORECASE), lambda match: '<body>'), (re.compile(r'<!--Article End-->.*</body>', re.DOTALL|re.IGNORECASE), lambda match: '</body>'), ] INDEX = 'http://sportsillustrated.cnn.com/' INDEX2 = 'http://sportsillustrated.cnn.com/vault/cover/home/index.htm' def parse_index(self): answer = [] soup = self.index_to_soup(self.INDEX2) #Loop through all of the "latest" covers until we find one that actually has articles for item in soup.findAll('div', attrs={'id': re.compile("ecomthumb_latest_*")}): regex = re.compile('ecomthumb_latest_(\d*)') result = regex.search(str(item)) current_issue_number = str(result.group(1)) current_issue_link = 'http://sportsillustrated.cnn.com/vault/cover/toc/' + current_issue_number + '/index.htm' self.log('Checking this link for a TOC: ', current_issue_link) index = self.index_to_soup(current_issue_link) if index: if index.find('div', 'siv_noArticleMessage'): self.log('No TOC for this one. Skipping...') else: self.log('Found a TOC... Using this link') regex = re.compile('(http://i.cdn.turner.com/sivault/si_online/covers/images.*jpg)') result = regex.search(str(index)) if result: self.log('Found Image: ', result.group(1)) self.cover_url = result.group(1).replace('mid', 'large') break # Find all articles. list = index.find('div', attrs = {'class' : 'siv_artList'}) if list: self.log ('found siv_artList') articles = [] # Get all the artcles ready for calibre. counter = 0 for headline in list.findAll('div', attrs = {'class' : 'headline'}): counter = counter + 1 title = self.tag_to_string(headline.a) + '\n' + self.tag_to_string(headline.findNextSibling('div', attrs = {'class' : 'info'})) url = self.INDEX + headline.a['href'] description = self.tag_to_string(headline.findNextSibling('a').div) article = {'title' : title, 'date' : u'', 'url' : url, 'description' : description} articles.append(article) #uncomment for test #if counter > 2: #break # See if we can find a meaningfull title feedTitle = 'Current Issue' hasTitle = index.find('div', attrs = {'class' : 'siv_imageText_head'}) if hasTitle : feedTitle = self.tag_to_string(hasTitle.h1) answer.append([feedTitle, articles]) return answer def print_version(self, url) : # This is the url and the parameters that work to get the print version. printUrl = 'http://si.printthis.clickability.com/pt/printThis?clickMap=printThis' printUrl += '&fb=Y&partnerID=2356&url=' + quote(url) return printUrl # However the original javascript also uses the following parameters, but they can be left out: # title : can be some random string # random : some random number, but I think the number of digits is important # expire : no idea what value to use # All this comes from the Javascript function that redirects to the print version. It's called PT() and is defined in the file 48.js '''def preprocess_html(self, soup): header = soup.find('div', attrs = {'class' : 'siv_artheader'}) homeMadeSoup = BeautifulSoup('<html><head></head><body></body></html>') body = homeMadeSoup.body # Find the date, title and byline temp = header.find('td', attrs = {'class' : 'title'}) if temp : date = temp.find('div', attrs = {'class' : 'date'}) if date: body.append(date) if temp.h1: body.append(temp.h1) if temp.h2 : body.append(temp.h2) byline = temp.find('div', attrs = {'class' : 'byline'}) if byline: body.append(byline) # Find the content for para in soup.findAll('div', attrs = {'class' : 'siv_artpara'}) : body.append(para) return homeMadeSoup ''' |
04-08-2011, 04:21 AM | #24 |
Member
Posts: 17
Karma: 10
Join Date: Sep 2010
Device: Kindle
|
Great that it's working again.
Can I customise it to return more than 100 articles? |
11-15-2013, 11:13 AM | #25 | |
Guru
Posts: 735
Karma: 35936
Join Date: Apr 2011
Location: Shrewsury, MA
Device: Lenovo Android Tablet
|
Quote:
I use the recipe created by kwetal. It stopped working a month or more ago- it downloads successfully except that it is stuck on the September 9, 2013 issue. |
|
11-25-2013, 01:10 PM | #26 |
Groupie
Posts: 155
Karma: 106422
Join Date: Nov 2010
Device: none
|
Yep - I noticed that it is frozen in time again. I'll try and get a few minutes over Thanksgiving weekend to play around with it again... been a while ;-)
|
11-29-2013, 08:31 AM | #27 |
Guru
Posts: 735
Karma: 35936
Join Date: Apr 2011
Location: Shrewsury, MA
Device: Lenovo Android Tablet
|
|
12-06-2013, 03:42 PM | #28 |
Groupie
Posts: 155
Karma: 106422
Join Date: Nov 2010
Device: none
|
Funny - I didn't do anything... something just started working again on the SI site I guess... ?
|
12-06-2013, 03:46 PM | #29 |
Guru
Posts: 735
Karma: 35936
Join Date: Apr 2011
Location: Shrewsury, MA
Device: Lenovo Android Tablet
|
|
01-29-2014, 10:23 AM | #30 | |
Guru
Posts: 735
Karma: 35936
Join Date: Apr 2011
Location: Shrewsury, MA
Device: Lenovo Android Tablet
|
Quote:
|
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
So, any of you into sports? | Manichean | Lounge | 43 | 12-15-2010 07:51 AM |
iPad NYT: Sports Illustrated Introduces iPad App | kjk | Apple Devices | 1 | 06-25-2010 03:56 AM |
Sports Illustrated Dazzling Tablet Device | Daithi | News | 20 | 12-04-2009 08:24 PM |
Sports Illustrated Feeds | geneaber | Calibre | 18 | 11-30-2009 12:08 PM |