11-01-2008, 10:17 AM | #1 |
Zealot
Posts: 100
Karma: 18
Join Date: Oct 2006
Location: N.J.
Device: Sony Readers PRS-500 exchanged by Sony for PRS-600, PRS-505,IPAD3,mini
|
Calibre-NY Times problem
Please take a look at the 11/1/08 edition of the NY Times. It downloads
ok on to the reader, and shows the titles for all of the newspapers articles. However when you open these titles to read the articles there is blank.. Thanks for any assistance.. |
11-01-2008, 11:16 AM | #2 |
creator of calibre
Posts: 44,356
Karma: 23661992
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
what is your output format? epub or lrf?
|
Advert | |
|
11-01-2008, 03:55 PM | #3 |
Zealot
Posts: 100
Karma: 18
Join Date: Oct 2006
Location: N.J.
Device: Sony Readers PRS-500 exchanged by Sony for PRS-600, PRS-505,IPAD3,mini
|
I am using lrf..
|
11-01-2008, 04:44 PM | #4 |
creator of calibre
Posts: 44,356
Karma: 23661992
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
sunday nytimes dont work with lrf try using epub
|
11-01-2008, 10:12 PM | #5 |
Member
Posts: 19
Karma: 10
Join Date: Oct 2008
Device: Sony PRS-505
|
I have the same problem with today's (Saturday) NY Times. A good portion of the articles show up with only an ad when viewed using the ebook viewer in Calibre. Some of the articles are fine, but most are not. These affected articles show up blank on the reader itself.
|
Advert | |
|
11-01-2008, 10:23 PM | #6 |
Zealot
Posts: 100
Karma: 18
Join Date: Oct 2006
Location: N.J.
Device: Sony Readers PRS-500 exchanged by Sony for PRS-600, PRS-505,IPAD3,mini
|
I have a Sony prs-500 reader and it is my understanding that epub will not work on 500.
If that is true, am I going to be unable to read the NY Times anymore? |
11-01-2008, 10:30 PM | #7 |
creator of calibre
Posts: 44,356
Karma: 23661992
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
I just had a look. Looks like the format of the website has changed. Will be fixed in the next release.
|
11-01-2008, 10:55 PM | #8 |
creator of calibre
Posts: 44,356
Karma: 23661992
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Here's the fixed recipe, the pesky nytimes was trying hard to insert more ads into the readers experience
Code:
import string, re from calibre import strftime from calibre.web.feeds.recipes import BasicNewsRecipe from calibre.ebooks.BeautifulSoup import BeautifulSoup class NYTimes(BasicNewsRecipe): title = 'The New York Times' __author__ = 'Kovid Goyal' description = 'Daily news from the New York Times' timefmt = ' [%a, %d %b, %Y]' needs_subscription = True remove_tags_before = dict(name='h1') remove_tags_after = dict(id='footer') remove_tags = [dict(attrs={'class':['articleTools', 'post-tools', 'side_tool']}), dict(id=['footer', 'navigation', 'archive', 'side_search', 'blog_sidebar', 'side_tool', 'side_index']), dict(name=['script', 'noscript'])] encoding = 'cp1252' no_stylesheets = True extra_css = 'h1 {font: sans-serif large;}\n.byline {font:monospace;}' def get_browser(self): br = BasicNewsRecipe.get_browser() if self.username is not None and self.password is not None: br.open('http://www.nytimes.com/auth/login') br.select_form(name='login') br['USERID'] = self.username br['PASSWORD'] = self.password br.submit() return br def parse_index(self): soup = self.index_to_soup('http://www.nytimes.com/pages/todayspaper/index.html') def feed_title(div): return ''.join(div.findAll(text=True, recursive=False)).strip() articles = {} key = None ans = [] for div in soup.findAll(True, attrs={'class':['section-headline', 'story', 'story headline']}): if div['class'] == 'section-headline': key = string.capwords(feed_title(div)) articles[key] = [] ans.append(key) elif div['class'] in ['story', 'story headline']: a = div.find('a', href=True) if not a: continue url = re.sub(r'\?.*', '', a['href']) url += '?pagewanted=print' title = self.tag_to_string(a, use_alt=True).strip() description = '' pubdate = strftime('%a, %d %b') summary = div.find(True, attrs={'class':'summary'}) if summary: description = self.tag_to_string(summary, use_alt=False) feed = key if key is not None else 'Uncategorized' if not articles.has_key(feed): articles[feed] = [] if not 'podcasts' in url: articles[feed].append( dict(title=title, url=url, date=pubdate, description=description, content='')) ans = self.sort_index_by(ans, {'The Front Page':-1, 'Dining In, Dining Out':1, 'Obituaries':2}) ans = [(key, articles[key]) for key in ans if articles.has_key(key)] return ans def preprocess_html(self, soup): refresh = soup.find('meta', {'http-equiv':'refresh'}) if refresh is None: return soup content = refresh.get('content').partition('=')[2] raw = self.browser.open('http://www.nytimes.com'+content).read() return BeautifulSoup(raw.decode('cp1252', 'replace')) |
11-02-2008, 01:19 AM | #9 |
Groupie
Posts: 176
Karma: 406
Join Date: Jan 2008
Device: Amazon Kindle 2, Amazon Kindle, Sony PRS-505
|
Kovid, is it possible to use the regular page instead of the print friendly version to grab the news articles? I understand that it was easier to parse the latter version, however the printer version doesn't have any of the nice news photos.
Normally pictures on PRS aren't necessarily a top priority. But believe or not, I now actually use Calibre to read NYTimes on my computer. It sounds crazy, but the advantage vs the web version is that I can linearly cruise through the day's story. It certainly beats a lot of mouse clicks. Plus absolutely no ads. |
11-02-2008, 01:32 AM | #10 |
creator of calibre
Posts: 44,356
Karma: 23661992
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Its certainly possible, but one would have to write a lot of junk html stripping code. I lack the desire to do that since I dont read the nytimes. But patches are welcome
And yeah, reading a calibre produces ebook beats reading on the web anyday, I do that on my tablet all the time, since I often have that and not my reader. |
11-02-2008, 11:57 PM | #11 |
Groupie
Posts: 176
Karma: 406
Join Date: Jan 2008
Device: Amazon Kindle 2, Amazon Kindle, Sony PRS-505
|
OK. I did a quick hack of Kovid's script. Disclaimer: I knew nothing about python. This is strictly a mimic/mod of his script. Also I found a bug along the way. Therefore, although this profile should theoretically work, I have to manually edit out the "imported css" in the htmls between the feeds2disk step and html2epub step. Otherwise html2epub kept report css selector error and consumed as much as 2GB memory before it hang up.
Code:
import string, re from calibre import strftime from calibre.web.feeds.recipes import BasicNewsRecipe from calibre.ebooks.BeautifulSoup import BeautifulSoup class NYTimes(BasicNewsRecipe): title = 'NY Times' __author__ = 'Kovid Goyal' description = 'Daily news from the New York Times' timefmt = ' [%a, %d %b, %Y]' needs_subscription = True remove_tags_before = dict(id='article') remove_tags_after = dict(id='article') remove_tags = [dict(attrs={'class':['articleTools', 'post-tools', 'side_tool', 'nextArticleLink clearfix']}), dict(id=['footer', 'toolsRight', 'articleInline', 'navigation', 'archive', 'side_search', 'blog_sidebar', 'side_tool', 'side_index']), dict(name=['script', 'noscript'])] encoding = 'cp1252' no_stylesheets = True extra_css = 'h1 {font: sans-serif large;}\n.byline {font:monospace;}' def get_browser(self): br = BasicNewsRecipe.get_browser() if self.username is not None and self.password is not None: br.open('http://www.nytimes.com/auth/login') br.select_form(name='login') br['USERID'] = self.username br['PASSWORD'] = self.password br.submit() return br def parse_index(self): soup = self.index_to_soup('http://www.nytimes.com/pages/todayspaper/index.html') def feed_title(div): return ''.join(div.findAll(text=True, recursive=False)).strip() articles = {} key = None ans = [] for div in soup.findAll(True, attrs={'class':['section-headline', 'story', 'story headline']}): if div['class'] == 'section-headline': key = string.capwords(feed_title(div)) articles[key] = [] ans.append(key) elif div['class'] in ['story', 'story headline']: a = div.find('a', href=True) if not a: continue url = re.sub(r'\?.*', '', a['href']) url += '?pagewanted=all' title = self.tag_to_string(a, use_alt=True).strip() description = '' pubdate = strftime('%a, %d %b') summary = div.find(True, attrs={'class':'summary'}) if summary: description = self.tag_to_string(summary, use_alt=False) feed = key if key is not None else 'Uncategorized' if not articles.has_key(feed): articles[feed] = [] if not 'podcasts' in url: articles[feed].append( dict(title=title, url=url, date=pubdate, description=description, content='')) ans = self.sort_index_by(ans, {'The Front Page':-1, 'Dining In, Dining Out':1, 'Obituaries':2}) ans = [(key, articles[key]) for key in ans if articles.has_key(key)] return ans def preprocess_html(self, soup): refresh = soup.find('meta', {'http-equiv':'refresh'}) if refresh is None: return soup content = refresh.get('content').partition('=')[2] raw = self.browser.open('http://www.nytimes.com'+content).read() return BeautifulSoup(raw.decode('cp1252', 'replace')) |
11-03-2008, 12:05 AM | #12 |
creator of calibre
Posts: 44,356
Karma: 23661992
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Change
Code:
dict(name=['script', 'noscript'] to dict(name=['script', 'noscript', 'style'] |
03-17-2009, 09:37 AM | #13 |
Addict
Posts: 385
Karma: 1010052
Join Date: Apr 2008
Device: (previous: Kindle 2, Kindle Fire) Kindle 4 WiFi, K3K, KPW
|
This is an old thread but I searched and can't seem to find a more recent one about a problem I've had since I upgraded to .5 - every time I download the NYT it outputs a .1mb file that doesn't have any content. That's for the non-subscription version. The subscription version outputs a 12mb+ file, and crashes my PRS-500 every time, so I can't use that. Did the recipe change for the NYT or is it just no longer available?
|
03-17-2009, 09:51 AM | #14 | |
Zealot
Posts: 100
Karma: 18
Join Date: Oct 2006
Location: N.J.
Device: Sony Readers PRS-500 exchanged by Sony for PRS-600, PRS-505,IPAD3,mini
|
Quote:
I would reinstall the .50 version and then download the subscription NYT again and see it that fixes the problem. |
|
03-17-2009, 12:12 PM | #15 |
Addict
Posts: 385
Karma: 1010052
Join Date: Apr 2008
Device: (previous: Kindle 2, Kindle Fire) Kindle 4 WiFi, K3K, KPW
|
Thanks. I'll do that when I get home and see if it works. I was getting so frustrated not being able to get the NY Times that I almost broke down and bought a Kindle.
I still might. But you may have removed my excuse. |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
NY Times Recipe in Calibre 6.36 Fails | keyrunner | Calibre | 1 | 01-28-2010 11:56 AM |
Download times for Calibre updates | brashley46 | Calibre | 9 | 03-23-2009 12:22 PM |
Calibre 4.102-NY Times problem | moosejons_dad | Calibre | 21 | 11-07-2008 09:05 PM |
calibre - New York Times - Sony Library Problem | Deputy-Dawg | Calibre | 5 | 06-21-2008 10:23 AM |
NY Times problem | radleyp | Feedback | 1 | 02-12-2003 02:04 PM |