Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 01-07-2011, 12:35 PM   #1
bcollier
Member
bcollier began at the beginning.
 
bcollier's Avatar
 
Posts: 22
Karma: 10
Join Date: Jan 2011
Device: Kindle DX
New York Times Descriptions Not Working

Hello,

This is my first time posting, but I hope to get more involved, love Calibre and hopefully I can contribute back somehow. A few things, but #1 is the most important.

1) One thing that could substantially improve the Calibre New York Times news subscription is having the article descriptions in the menu. The descriptions are in the menu for the Front Page articles, but not for any other articles (see what I'm talking about in the two pictures of my Kindle). It would be very easy to fill in the description with the first two sentences from the article in the case where the description is blank rather than leave it blank.

I spent several hours trying to customize the recipe to use the first two sentences but couldn't figure out how to get a hold of the text body.

in the handle_article function these two lines set the description from what I can tell:

description = ''
pubdate = strftime('%a, %d %b')
summary = div.find(True, attrs={'class':'summary'})
if summary:
description = self.tag_to_string(summary, use_alt=False)

How would we update these lines to parse out the first two lines from the article rather than the blank string?

(2) On the topic of the NYT, what is the best time of day to schedule the New York Times for download? I've been doing 6am, but at that time there are only 1 or 2 articles in the Front Page section, at 8am this morning I accidentally downloaded again, and noticed the front page section fill up with 7 articles. Has anyone experimented with this? I am experimenting now, downloading the web version every hour to see about what time the NYT's adds articles to these versions

(3) Also on the topic of the NYT for a long time they have been talking about a paywall for web content on the NYT (http://www.nytimes.com/2010/01/21/bu...a/21times.html). Has anyone heard if/when this is going into effect (this month?) and how that will effect the Calibre download?

I would love to help with Calibre development, I write in Python for work so it's no problem to learn, just need to learn the ins and outs of how the system works, and what improvements are needed.
Attached Thumbnails
Click image for larger version

Name:	IMG_0399.JPG
Views:	272
Size:	275.2 KB
ID:	64316   Click image for larger version

Name:	IMG_0401.JPG
Views:	252
Size:	234.4 KB
ID:	64317  
bcollier is offline   Reply With Quote
Old 01-07-2011, 02:29 PM   #2
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,857
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
1) Use the populate_article_metadata method.

3) It wont make any difference, calibre supports paywalled sites just fine, see WSJ for an example.
kovidgoyal is offline   Reply With Quote
Old 01-07-2011, 04:06 PM   #3
bcollier
Member
bcollier began at the beginning.
 
bcollier's Avatar
 
Posts: 22
Karma: 10
Join Date: Jan 2011
Device: Kindle DX
Ok, thanks for the quick response. Where is the documentation for the article object being passed in? I'm just looking for the main article text and can't seem to get it in the populate_article_metadata. I have

if len(article.text_summary) == 0:
article.text_summary = "the first two sentences of the article"


should I somehow pull the main article text from soup, or is it already parsed in in the article object?


Quote:
Originally Posted by kovidgoyal View Post
1) Use the populate_article_metadata method.

3) It wont make any difference, calibre supports paywalled sites just fine, see WSJ for an example.
bcollier is offline   Reply With Quote
Old 01-07-2011, 04:31 PM   #4
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,857
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
look at feeds.__init__

You have to pull content from the soup
kovidgoyal is offline   Reply With Quote
Old 01-08-2011, 08:58 AM   #5
GRiker
Comparer of the Ephemeris
GRiker ought to be getting tired of karma fortunes by now.GRiker ought to be getting tired of karma fortunes by now.GRiker ought to be getting tired of karma fortunes by now.GRiker ought to be getting tired of karma fortunes by now.GRiker ought to be getting tired of karma fortunes by now.GRiker ought to be getting tired of karma fortunes by now.GRiker ought to be getting tired of karma fortunes by now.GRiker ought to be getting tired of karma fortunes by now.GRiker ought to be getting tired of karma fortunes by now.GRiker ought to be getting tired of karma fortunes by now.GRiker ought to be getting tired of karma fortunes by now.
 
Posts: 1,496
Karma: 424697
Join Date: Mar 2009
Device: iPad
Quote:
Originally Posted by bcollier View Post
Ok, thanks for the quick response. Where is the documentation for the article object being passed in? I'm just looking for the main article text and can't seem to get it in the populate_article_metadata. I have

if len(article.text_summary) == 0:
article.text_summary = "the first two sentences of the article"


should I somehow pull the main article text from soup, or is it already parsed in in the article object?
This populate_article_metadata() function was once in the NYTimes recipe, but was removed at some point. You can use it as a point of reference:

Spoiler:
Code:
def populate_article_metadata(self,article,soup,first):
        '''
        Extract author and description from article, add to article metadata
        '''
        def extract_author(soup):
            byline = soup.find('meta',attrs={'name':['byl','CLMST']})
            if byline :
                author = byline['content']
            else :
                # Try for <div class="byline">
                byline = soup.find('div', attrs={'class':'byline'})
                if byline:
                    author = byline.renderContents()
                else:
                    print soup.prettify()
                    return None
            return author

        def extract_description(soup):
            description = soup.find('meta',attrs={'name':['description','description ']})
            if description :
                return self.massageNCXText(description['content'])
            else:
                # Take first paragraph of article
                articlebody = soup.find('div',attrs={'id':'articlebody'})
                if not articlebody:
                    # Try again with class instead of id
                    articlebody = soup.find('div',attrs={'class':'articlebody'})
                    if not articlebody:
                        print 'postprocess_book.extract_description(): Did not find <div id="articlebody">:'
                        print soup.prettify()
                        return None
                paras = articlebody.findAll('p')
                for p in paras:
                    if p.renderContents() > '' :
                        return self.massageNCXText(self.tag_to_string(p,use_alt=False))
                return None

        article.author = extract_author(soup)
        article.summary = article.text_summary = extract_description(soup)


G
GRiker is offline   Reply With Quote
Old 01-10-2011, 03:15 PM   #6
bcollier
Member
bcollier began at the beginning.
 
bcollier's Avatar
 
Posts: 22
Karma: 10
Join Date: Jan 2011
Device: Kindle DX
Thanks, is there a reason the "print" statements don't show up in the command line from within a recipe? when I do a print "hello world" from elsewhere in the application it prints to the command line in windows (when calling with calibre-debug -g). Or is there a method for writing to a log file? I just have some strange things happening and it would be helpful to have a method to see what is happening with the text.

also, is there a way to see the mobi metadata (the summaries for each article) without having to copy them to my kindle each time? mobi readers will show the metadata for the whole book, but I don't see anything that does it for every article within the file.

Quote:
Originally Posted by GRiker View Post
This populate_article_metadata() function was once in the NYTimes recipe, but was removed at some point. You can use it as a point of reference:

Spoiler:
Code:
def populate_article_metadata(self,article,soup,first):
        '''
        Extract author and description from article, add to article metadata
        '''
        def extract_author(soup):
            byline = soup.find('meta',attrs={'name':['byl','CLMST']})
            if byline :
                author = byline['content']
            else :
                # Try for <div class="byline">
                byline = soup.find('div', attrs={'class':'byline'})
                if byline:
                    author = byline.renderContents()
                else:
                    print soup.prettify()
                    return None
            return author

        def extract_description(soup):
            description = soup.find('meta',attrs={'name':['description','description ']})
            if description :
                return self.massageNCXText(description['content'])
            else:
                # Take first paragraph of article
                articlebody = soup.find('div',attrs={'id':'articlebody'})
                if not articlebody:
                    # Try again with class instead of id
                    articlebody = soup.find('div',attrs={'class':'articlebody'})
                    if not articlebody:
                        print 'postprocess_book.extract_description(): Did not find <div id="articlebody">:'
                        print soup.prettify()
                        return None
                paras = articlebody.findAll('p')
                for p in paras:
                    if p.renderContents() > '' :
                        return self.massageNCXText(self.tag_to_string(p,use_alt=False))
                return None

        article.author = extract_author(soup)
        article.summary = article.text_summary = extract_description(soup)


G
bcollier is offline   Reply With Quote
Old 01-10-2011, 04:04 PM   #7
GRiker
Comparer of the Ephemeris
GRiker ought to be getting tired of karma fortunes by now.GRiker ought to be getting tired of karma fortunes by now.GRiker ought to be getting tired of karma fortunes by now.GRiker ought to be getting tired of karma fortunes by now.GRiker ought to be getting tired of karma fortunes by now.GRiker ought to be getting tired of karma fortunes by now.GRiker ought to be getting tired of karma fortunes by now.GRiker ought to be getting tired of karma fortunes by now.GRiker ought to be getting tired of karma fortunes by now.GRiker ought to be getting tired of karma fortunes by now.GRiker ought to be getting tired of karma fortunes by now.
 
Posts: 1,496
Karma: 424697
Join Date: Mar 2009
Device: iPad
Quote:
Originally Posted by bcollier View Post
Thanks, is there a reason the "print" statements don't show up in the command line from within a recipe? when I do a print "hello world" from elsewhere in the application it prints to the command line in windows (when calling with calibre-debug -g). Or is there a method for writing to a log file? I just have some strange things happening and it would be helpful to have a method to see what is happening with the text.
Use self.log() to print diagnostics.

Quote:
also, is there a way to see the mobi metadata (the summaries for each article) without having to copy them to my kindle each time? mobi readers will show the metadata for the whole book, but I don't see anything that does it for every article within the file.
I don't understand what you're asking. If you want to see the summaries while the recipe's being built, write a diagnostic subroutine to dump the metadata.

G
GRiker is offline   Reply With Quote
Old 01-11-2011, 02:17 PM   #8
bcollier
Member
bcollier began at the beginning.
 
bcollier's Avatar
 
Posts: 22
Karma: 10
Join Date: Jan 2011
Device: Kindle DX
Thanks! This worked great and sped up the work a lot. I'll start a new thread with my proposed changes to the NYT recipes.

Quote:
Originally Posted by GRiker View Post
Use self.log() to print diagnostics.

I don't understand what you're asking. If you want to see the summaries while the recipe's being built, write a diagnostic subroutine to dump the metadata.

G
bcollier is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
New York Times Error in .6.23 geneaber Calibre 0 11-14-2009 12:27 PM
New York Times recipe madrone26 Calibre 4 04-02-2009 01:13 PM
New York Times on 505 Hamza Sony Reader 21 03-03-2008 12:55 PM
iLiad New York Times King Mook Mook iRex 0 12-30-2007 03:22 PM
New Reader Ad in New York Times TadW Sony Reader 7 07-28-2007 01:11 PM


All times are GMT -4. The time now is 01:21 AM.


MobileRead.com is a privately owned, operated and funded community.