Calibre-NY Times problem

moosejons_dad · 11-01-2008, 10:17 AM

Please take a look at the 11/1/08 edition of the NY Times. It downloads
ok on to the reader, and shows the titles for all of the newspapers articles.
However when you open these titles to read the articles there is blank..
Thanks for any assistance..

kovidgoyal · 11-01-2008, 11:16 AM

what is your output format? epub or lrf?

moosejons_dad · 11-01-2008, 03:55 PM

I am using lrf..

kovidgoyal · 11-01-2008, 04:44 PM

sunday nytimes dont work with lrf try using epub

Acey · 11-01-2008, 10:12 PM

I have the same problem with today's (Saturday) NY Times. A good portion of the articles show up with only an ad when viewed using the ebook viewer in Calibre. Some of the articles are fine, but most are not. These affected articles show up blank on the reader itself.

moosejons_dad · 11-01-2008, 10:23 PM

I have a Sony prs-500 reader and it is my understanding that epub will not work on 500.
If that is true, am I going to be unable to read the NY Times anymore?

kovidgoyal · 11-01-2008, 10:30 PM

I just had a look. Looks like the format of the website has changed. Will be fixed in the next release.

kovidgoyal · 11-01-2008, 10:55 PM

Here's the fixed recipe, the pesky nytimes was trying hard to insert more ads into the readers experience

Code:

import string, re
from calibre import strftime
from calibre.web.feeds.recipes import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup

class NYTimes(BasicNewsRecipe):
    
    title       = 'The New York Times'
    __author__  = 'Kovid Goyal'
    description = 'Daily news from the New York Times'
    timefmt = ' [%a, %d %b, %Y]'
    needs_subscription = True
    
    remove_tags_before = dict(name='h1')
    remove_tags_after  = dict(id='footer')
    remove_tags = [dict(attrs={'class':['articleTools', 'post-tools', 'side_tool']}), 
                   dict(id=['footer', 'navigation', 'archive', 'side_search', 'blog_sidebar', 'side_tool', 'side_index']), 
                   dict(name=['script', 'noscript'])]
    encoding = 'cp1252'
    no_stylesheets = True
    extra_css = 'h1 {font: sans-serif large;}\n.byline {font:monospace;}'
    
    def get_browser(self):
        br = BasicNewsRecipe.get_browser()
        if self.username is not None and self.password is not None:
            br.open('http://www.nytimes.com/auth/login')
            br.select_form(name='login')
            br['USERID']   = self.username
            br['PASSWORD'] = self.password
            br.submit()
        return br
    
    def parse_index(self):
        soup = self.index_to_soup('http://www.nytimes.com/pages/todayspaper/index.html')
        
        def feed_title(div):
            return ''.join(div.findAll(text=True, recursive=False)).strip()
        
        articles = {}
        key = None
        ans = []
        for div in soup.findAll(True, 
            attrs={'class':['section-headline', 'story', 'story headline']}):
            
            if div['class'] == 'section-headline':
                key = string.capwords(feed_title(div))
                articles[key] = []
                ans.append(key)
            
            elif div['class'] in ['story', 'story headline']:
                a = div.find('a', href=True)
                if not a:
                    continue
                url = re.sub(r'\?.*', '', a['href'])
                url += '?pagewanted=print'
                title = self.tag_to_string(a, use_alt=True).strip()
                description = ''
                pubdate = strftime('%a, %d %b')
                summary = div.find(True, attrs={'class':'summary'})
                if summary:
                    description = self.tag_to_string(summary, use_alt=False)
                
                feed = key if key is not None else 'Uncategorized'
                if not articles.has_key(feed):
                    articles[feed] = []
                if not 'podcasts' in url:
                    articles[feed].append(
                                  dict(title=title, url=url, date=pubdate, 
                                       description=description,
                                       content=''))
        ans = self.sort_index_by(ans, {'The Front Page':-1, 'Dining In, Dining Out':1, 'Obituaries':2})
        ans = [(key, articles[key]) for key in ans if articles.has_key(key)]
        return ans
    
    def preprocess_html(self, soup):
        refresh = soup.find('meta', {'http-equiv':'refresh'})
        if refresh is None:
            return soup
        content = refresh.get('content').partition('=')[2]
        raw = self.browser.open('http://www.nytimes.com'+content).read()
        return BeautifulSoup(raw.decode('cp1252', 'replace'))

lovebeta · 11-02-2008, 01:19 AM

Kovid, is it possible to use the regular page instead of the print friendly version to grab the news articles? I understand that it was easier to parse the latter version, however the printer version doesn't have any of the nice news photos.

Normally pictures on PRS aren't necessarily a top priority. But believe or not, I now actually use Calibre to read NYTimes on my computer. It sounds crazy, but the advantage vs the web version is that I can linearly cruise through the day's story. It certainly beats a lot of mouse clicks. Plus absolutely no ads.

kovidgoyal · 11-02-2008, 01:32 AM

Its certainly possible, but one would have to write a lot of junk html stripping code. I lack the desire to do that since I dont read the nytimes. But patches are welcome

And yeah, reading a calibre produces ebook beats reading on the web anyday, I do that on my tablet all the time, since I often have that and not my reader.

lovebeta · 11-02-2008, 11:57 PM

OK. I did a quick hack of Kovid's script. Disclaimer: I knew nothing about python. This is strictly a mimic/mod of his script. Also I found a bug along the way. Therefore, although this profile should theoretically work, I have to manually edit out the "imported css" in the htmls between the feeds2disk step and html2epub step. Otherwise html2epub kept report css selector error and consumed as much as 2GB memory before it hang up.

Code:

import string, re
from calibre import strftime
from calibre.web.feeds.recipes import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup

class NYTimes(BasicNewsRecipe):
    
    title       = 'NY Times'
    __author__  = 'Kovid Goyal'
    description = 'Daily news from the New York Times'
    timefmt = ' [%a, %d %b, %Y]'
    needs_subscription = True

    remove_tags_before = dict(id='article')
    remove_tags_after  = dict(id='article')
    remove_tags = [dict(attrs={'class':['articleTools', 'post-tools', 'side_tool', 'nextArticleLink clearfix']}), 
                   dict(id=['footer', 'toolsRight', 'articleInline', 'navigation', 'archive', 'side_search', 'blog_sidebar', 'side_tool', 'side_index']), 
                   dict(name=['script', 'noscript'])]
    encoding = 'cp1252'
    no_stylesheets = True
    extra_css = 'h1 {font: sans-serif large;}\n.byline {font:monospace;}'
    
    def get_browser(self):
        br = BasicNewsRecipe.get_browser()
        if self.username is not None and self.password is not None:
            br.open('http://www.nytimes.com/auth/login')
            br.select_form(name='login')
            br['USERID']   = self.username
            br['PASSWORD'] = self.password
            br.submit()
        return br
    
    def parse_index(self):
        soup = self.index_to_soup('http://www.nytimes.com/pages/todayspaper/index.html')
        
        def feed_title(div):
            return ''.join(div.findAll(text=True, recursive=False)).strip()
        
        articles = {}
        key = None
        ans = []
        for div in soup.findAll(True, 
            attrs={'class':['section-headline', 'story', 'story headline']}):
            
            if div['class'] == 'section-headline':
                key = string.capwords(feed_title(div))
                articles[key] = []
                ans.append(key)
            
            elif div['class'] in ['story', 'story headline']:
                a = div.find('a', href=True)
                if not a:
                    continue
                url = re.sub(r'\?.*', '', a['href'])
                url += '?pagewanted=all'
                title = self.tag_to_string(a, use_alt=True).strip()
                description = ''
                pubdate = strftime('%a, %d %b')
                summary = div.find(True, attrs={'class':'summary'})
                if summary:
                    description = self.tag_to_string(summary, use_alt=False)
                
                feed = key if key is not None else 'Uncategorized'
                if not articles.has_key(feed):
                    articles[feed] = []
                if not 'podcasts' in url:
                    articles[feed].append(
                                  dict(title=title, url=url, date=pubdate, 
                                       description=description,
                                       content=''))
        ans = self.sort_index_by(ans, {'The Front Page':-1, 'Dining In, Dining Out':1, 'Obituaries':2})
        ans = [(key, articles[key]) for key in ans if articles.has_key(key)]
        return ans
    
    def preprocess_html(self, soup):
        refresh = soup.find('meta', {'http-equiv':'refresh'})
        if refresh is None:
            return soup
        content = refresh.get('content').partition('=')[2]
        raw = self.browser.open('http://www.nytimes.com'+content).read()
        return BeautifulSoup(raw.decode('cp1252', 'replace'))

kovidgoyal · 11-03-2008, 12:05 AM

Change

Code:

dict(name=['script', 'noscript']
to
dict(name=['script', 'noscript', 'style']

and it will work with html2epub as well

weatherman · 03-17-2009, 09:37 AM

This is an old thread but I searched and can't seem to find a more recent one about a problem I've had since I upgraded to .5 - every time I download the NYT it outputs a .1mb file that doesn't have any content. That's for the non-subscription version. The subscription version outputs a 12mb+ file, and crashes my PRS-500 every time, so I can't use that. Did the recipe change for the NYT or is it just no longer available?

moosejons_dad · 03-17-2009, 09:51 AM

Quote:

Originally Posted by weatherman

This is an old thread but I searched and can't seem to find a more recent one about a problem I've had since I upgraded to .5 - every time I download the NYT it outputs a .1mb file that doesn't have any content. That's for the non-subscription version. The subscription version outputs a 12mb+ file, and crashes my PRS-500 every time, so I can't use that. Did the recipe change for the NYT or is it just no longer available?

The subscription NYT works for me and I use Prs-500 for my reading and I have upgraded to .5 version also. The file is a little over 2 mb today...
I would reinstall the .50 version and then download the subscription NYT again and see it that fixes the problem.

weatherman · 03-17-2009, 12:12 PM

Thanks. I'll do that when I get home and see if it works. I was getting so frustrated not being able to get the NY Times that I almost broke down and bought a Kindle.

I still might. But you may have removed my excuse.

11-01-2008, 10:17 AM	#1
moosejons_dad Zealot Posts: 100 Karma: 18 Join Date: Oct 2006 Location: N.J. Device: Sony Readers PRS-500 exchanged by Sony for PRS-600, PRS-505,IPAD3,mini	Calibre-NY Times problem Please take a look at the 11/1/08 edition of the NY Times. It downloads ok on to the reader, and shows the titles for all of the newspapers articles. However when you open these titles to read the articles there is blank.. Thanks for any assistance..

11-03-2008, 12:05 AM	#12
kovidgoyal creator of calibre Posts: 44,356 Karma: 23661992 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Change Code: dict(name=['script', 'noscript'] to dict(name=['script', 'noscript', 'style'] and it will work with html2epub as well

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
NY Times Recipe in Calibre 6.36 Fails	keyrunner	Calibre	1	01-28-2010 11:56 AM
Download times for Calibre updates	brashley46	Calibre	9	03-23-2009 12:22 PM
Calibre 4.102-NY Times problem	moosejons_dad	Calibre	21	11-07-2008 09:05 PM
calibre - New York Times - Sony Library Problem	Deputy-Dawg	Calibre	5	06-21-2008 10:23 AM
NY Times problem	radleyp	Feedback	1	02-12-2003 02:04 PM

11-01-2008, 11:16 AM	#2
kovidgoyal creator of calibre Posts: 44,356 Karma: 23661992 Join Date: Oct 2006 Location: Mumbai, India Device: Various	what is your output format? epub or lrf?

11-01-2008, 03:55 PM	#3
moosejons_dad Zealot Posts: 100 Karma: 18 Join Date: Oct 2006 Location: N.J. Device: Sony Readers PRS-500 exchanged by Sony for PRS-600, PRS-505,IPAD3,mini	I am using lrf..

11-01-2008, 04:44 PM	#4
kovidgoyal creator of calibre Posts: 44,356 Karma: 23661992 Join Date: Oct 2006 Location: Mumbai, India Device: Various	sunday nytimes dont work with lrf try using epub

11-01-2008, 10:12 PM	#5
Acey Member Posts: 19 Karma: 10 Join Date: Oct 2008 Device: Sony PRS-505	I have the same problem with today's (Saturday) NY Times. A good portion of the articles show up with only an ad when viewed using the ebook viewer in Calibre. Some of the articles are fine, but most are not. These affected articles show up blank on the reader itself.

11-01-2008, 10:23 PM	#6
moosejons_dad Zealot Posts: 100 Karma: 18 Join Date: Oct 2006 Location: N.J. Device: Sony Readers PRS-500 exchanged by Sony for PRS-600, PRS-505,IPAD3,mini	I have a Sony prs-500 reader and it is my understanding that epub will not work on 500. If that is true, am I going to be unable to read the NY Times anymore?

11-01-2008, 10:30 PM	#7
kovidgoyal creator of calibre Posts: 44,356 Karma: 23661992 Join Date: Oct 2006 Location: Mumbai, India Device: Various	I just had a look. Looks like the format of the website has changed. Will be fixed in the next release.

11-02-2008, 01:19 AM	#9
lovebeta Groupie Posts: 176 Karma: 406 Join Date: Jan 2008 Device: Amazon Kindle 2, Amazon Kindle, Sony PRS-505	Kovid, is it possible to use the regular page instead of the print friendly version to grab the news articles? I understand that it was easier to parse the latter version, however the printer version doesn't have any of the nice news photos. Normally pictures on PRS aren't necessarily a top priority. But believe or not, I now actually use Calibre to read NYTimes on my computer. It sounds crazy, but the advantage vs the web version is that I can linearly cruise through the day's story. It certainly beats a lot of mouse clicks. Plus absolutely no ads.

11-02-2008, 01:32 AM	#10
kovidgoyal creator of calibre Posts: 44,356 Karma: 23661992 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Its certainly possible, but one would have to write a lot of junk html stripping code. I lack the desire to do that since I dont read the nytimes. But patches are welcome And yeah, reading a calibre produces ebook beats reading on the web anyday, I do that on my tablet all the time, since I often have that and not my reader.

03-17-2009, 09:37 AM	#13
weatherman Addict Posts: 385 Karma: 1010052 Join Date: Apr 2008 Device: (previous: Kindle 2, Kindle Fire) Kindle 4 WiFi, K3K, KPW	This is an old thread but I searched and can't seem to find a more recent one about a problem I've had since I upgraded to .5 - every time I download the NYT it outputs a .1mb file that doesn't have any content. That's for the non-subscription version. The subscription version outputs a 12mb+ file, and crashes my PRS-500 every time, so I can't use that. Did the recipe change for the NYT or is it just no longer available?

03-17-2009, 12:12 PM	#15
weatherman Addict Posts: 385 Karma: 1010052 Join Date: Apr 2008 Device: (previous: Kindle 2, Kindle Fire) Kindle 4 WiFi, K3K, KPW	Thanks. I'll do that when I get home and see if it works. I was getting so frustrated not being able to get the NY Times that I almost broke down and bought a Kindle. I still might. But you may have removed my excuse.

Advert

Advert