Create Article Sections From Feed?

Finbar127 · 02-24-2011, 09:57 PM

Hi Everyone,

I have a local newspaper that lists all sections in the same feed:

http://www.mahopacnews.com/rssheadlines.xml

I have the following recipe to grab the articles

Spoiler:

However I get all the articles in one section. I would like to split the articles into their appropriate sections (News, Sports, Opinions, etc). On the feed page each article has the appropriate section name at the beginning of the article title. Eample:

News: Wild guest wows Mahopac Middle School students
News: Can Rotary Dream Team beat Harlem Magic Masters this year on March 11?
Opinion: Dressing for the next ice age

I added the following to the recipe which allowed me to filter the articles based upon a key word in the title:

Code:

def get_article_url(self, article): 
        link = article.get('link')
        title = article.get('title')
        if 'News:' in title:
             return link

Is there a way to modify this so Calibre will run this on each key word then separate the articles into their appropriate sections?

Starson17 · 02-25-2011, 07:48 AM

Quote:

Originally Posted by Finbar127

Is there a way to modify this so Calibre will run this on each key word then separate the articles into their appropriate sections?

See here:
https://www.mobileread.com/forums/sho...45&postcount=2

Finbar127 · 02-25-2011, 05:21 PM

Thanks. I tried adjusting that code to fit my recipe and I ended up with this:

Spoiler:

I get the following error:

line 32, in parse_feeds
pfeed = Feed()
NameError: global name 'Feed' is not defined

Do you know where I would define Feed in the recipe? Also Could you point me to a recipe where this particular chunk of code is used?

Starson17 · 02-25-2011, 08:54 PM

Quote:

Originally Posted by Finbar127

Do you know where I would define Feed in the recipe? Also Could you point me to a recipe where this particular chunk of code is used?

from calibre.web.feeds import Feed
Reader's Digest follows:

Spoiler:

Code:

#!/usr/bin/env  python
__license__   = 'GPL v3'
'''
'''
from calibre.web.feeds.recipes import BasicNewsRecipe
from calibre.web.feeds import Feed


class ReadersDigest(BasicNewsRecipe):

    title       = 'Readers Digest'
    __author__  = 'BrianG'
    language = 'en'
    description = 'Readers Digest Feeds'
    no_stylesheets        = True
    use_embedded_content  = False
    oldest_article = 60
    max_articles_per_feed = 200

    language = 'en'
    remove_javascript     = True

    extra_css      = ''' h1 {font-family:georgia,serif;color:#000000;}
                        .mainHd{font-family:georgia,serif;color:#000000;}
                         h2 {font-family:Arial,Sans-serif;}
                        .name{font-family:Arial,Sans-serif; font-size:x-small;font-weight:bold; }
                        .date{font-family:Arial,Sans-serif; font-size:x-small ;color:#999999;}
                        .byline{font-family:Arial,Sans-serif; font-size:x-small ;}
                        .photoBkt{ font-size:x-small ;}
                        .vertPhoto{font-size:x-small ;}
                        .credits{font-family:Arial,Sans-serif; font-size:x-small ;color:gray;}
                        .credit{font-family:Arial,Sans-serif; font-size:x-small ;color:gray;}
                        .artTxt{font-family:georgia,serif;}
                        .caption{font-family:georgia,serif; font-size:x-small;color:#333333;}
                        .credit{font-family:georgia,serif; font-size:x-small;color:#999999;}
                        a:link{color:#CC0000;}
                        .breadcrumb{font-family:Arial,Sans-serif;font-size:x-small;}
                        '''


    remove_tags = [
        dict(name='h4', attrs={'class':'close'}),
        dict(name='div', attrs={'class':'fromLine'}),
        dict(name='img', attrs={'class':'colorTag'}),
        dict(name='div', attrs={'id':'sponsorArticleHeader'}),
        dict(name='div', attrs={'class':'horizontalAd'}),
        dict(name='div', attrs={'id':'imageCounterLeft'}),
        dict(name='div', attrs={'id':'commentsPrint'})
        ]


    feeds = [
            ('New in RD', 'http://feeds.rd.com/ReadersDigest'),
            ('Jokes', 'http://feeds.rd.com/ReadersDigestJokes'),
            ('Cartoons', 'http://feeds.rd.com/ReadersDigestCartoons'),
            ('Blogs','http://feeds.rd.com/ReadersDigestBlogs')
        ]

    cover_url = 'http://www.rd.com/images/logo-main-rd.gif'



#-------------------------------------------------------------------------------------------------

    def print_version(self, url):

        # Get the identity number of the current article and append it to the root print URL

        if url.find('/article') > 0:
            ident = url[url.find('/article')+8:url.find('.html?')-4]
            url = 'http://www.rd.com/content/printContent.do?contentId=' + ident

        elif url.find('/post') > 0:

            # in this case, have to get the page itself to derive the Print page.
            soup = self.index_to_soup(url)
            newsoup = soup.find('ul',attrs={'class':'printBlock'})
            url = 'http://www.rd.com' + newsoup('a')[0]['href']
            url = url[0:url.find('&Keep')]

        return url

#-------------------------------------------------------------------------------------------------

    def parse_index(self):

        pages = [
                ('Your America','http://www.rd.com/your-america-inspiring-people-and-stories', 'channelLeftContainer',{'class':'moreLeft'}),
                # useless recipes ('Living Healthy','http://www.rd.com/living-healthy', 'channelLeftContainer',{'class':'moreLeft'}),
                ('Advice and Know-How','http://www.rd.com/advice-and-know-how', 'channelLeftContainer',{'class':'moreLeft'})

            ]

        feeds = []

        for page in pages:
            section, url, divider, attrList = page
            newArticles = self.page_parse(url, divider, attrList)
            feeds.append((section,newArticles))

        # after the pages of the site have been processed, parse several RSS feeds for additional sections
        newfeeds = Feed()
        newfeeds = self.parse_rss()


        # The utility code in parse_rss returns a Feed object.  Convert each feed/article combination into a form suitable
        # for this module (parse_index).

        for feed in newfeeds:
            newArticles = []
            for article in feed.articles:
                newArt = {
                            'title' : article.title,
                            'url'   : article.url,
                            'date'  : article.date,
                            'description' : article.text_summary
                        }
                newArticles.append(newArt)


            # New and Blogs should be the first two feeds.
            if feed.title == 'New in RD':
                feeds.insert(0,(feed.title,newArticles))
            elif feed.title == 'Blogs':
                feeds.insert(1,(feed.title,newArticles))
            else:
                feeds.append((feed.title,newArticles))


        return feeds

#-------------------------------------------------------------------------------------------------

    def page_parse(self, mainurl, divider, attrList):

        articles = []
        mainsoup = self.index_to_soup(mainurl)
        for item in mainsoup.findAll(attrs=attrList):
            newArticle = {
                        'title' : item('img')[0]['alt'],
                        'url'   : 'http://www.rd.com'+item('a')[0]['href'],
                        'date'  : '',
                        'description' : ''
                    }
            articles.append(newArticle)



        return articles



#-------------------------------------------------------------------------------------------------

    def parse_rss (self):

        # Do the "official" parse_feeds first
        feeds = BasicNewsRecipe.parse_feeds(self)


        # Loop thru the articles in all feeds to find articles with "recipe" in it
        recipeArticles = []
        for curfeed in feeds:
            delList = []
            for a,curarticle in enumerate(curfeed.articles):
                if curarticle.title.upper().find('RECIPE') >= 0:
                    recipeArticles.append(curarticle)
                    delList.append(curarticle)
            if len(delList)>0:
                for d in delList:
                    index = curfeed.articles.index(d)
                    curfeed.articles[index:index+1] = []

        # If there are any recipes found, create a new Feed object and append.
        if len(recipeArticles) > 0:
            pfeed = Feed()
            pfeed.title = 'Recipes'
            pfeed.descrition = 'Recipe Feed (Virtual)'
            pfeed.image_url  = None
            pfeed.oldest_article = 30
            pfeed.id_counter = len(recipeArticles)
            # Create a new Feed, add the recipe articles, and then append
            # to "official" list of feeds
            pfeed.articles = recipeArticles[:]
            feeds.append(pfeed)

        return feeds

Finbar127 · 02-26-2011, 12:18 AM

Thanks again. Like you said my recipe was missing the following line:

Code:

 

from calibre.web.feeds import Feed

Adding it did the trick.

Here is the working recipe. It can probably be optimized but it basically works and the resulting file looks great on my Kindle3:

Spoiler:

Code:

from calibre.web.feeds.recipes import BasicNewsRecipe
from calibre.web.feeds import Feed

class AdvancedUserRecipe1297969350(BasicNewsRecipe):
    title = u'Mahopac News'
    description = 'Mahopac News Features'
    oldest_article = 6
    max_articles_per_feed = 100
    no_stylesheets = True
    remove_attributes=['style'] 
    remove_javascript = True
    conversion_options = {'linearize_tables' : True}
    remove_tags = [dict(name='span', attrs={'class':'lp'})]
    extra_css = '.title {font-size: x-large; font-weight: bold}'

    feeds = [(u' ', u'http://www.mahopacnews.com/rssheadlines.xml')]

    def parse_feeds (self):
          feeds = BasicNewsRecipe.parse_feeds(self)
          newsArticles = []
          opinionArticles = []
          artsandleisureArticles = []
          sportsArticles = []
          announcementsArticles = []

          for curfeed in feeds:
                delList = []
                for a,curarticle in enumerate(curfeed.articles):
                      if curarticle.title.upper().find('NEWS:') >= 0:
                            newsArticles.append(curarticle)
                            delList.append(curarticle)
                if len(delList)>0:
                      for d in delList:
                            index = curfeed.articles.index(d)
                            curfeed.articles[index:index+1] = []

          if len(newsArticles) > 0:
                pfeed = Feed()
                pfeed.title = 'News'
                pfeed.descrition = 'News Feed (Virtual)'
                pfeed.image_url  = None
                pfeed.oldest_article = 6
                pfeed.id_counter = len(newsArticles)
                pfeed.articles = newsArticles[:]
                feeds.append(pfeed)
  
          for curfeed in feeds:
                delList = []
                for a,curarticle in enumerate(curfeed.articles):
                      if curarticle.title.upper().find('OPINION:') >= 0:
                            opinionArticles.append(curarticle)
                            delList.append(curarticle)
                if len(delList)>0:
                      for d in delList:
                            index = curfeed.articles.index(d)
                            curfeed.articles[index:index+1] = []

          if len(opinionArticles) > 0:
                pfeed = Feed()
                pfeed.title = 'Opinion'
                pfeed.descrition = 'Opinion Feed (Virtual)'
                pfeed.image_url  = None
                pfeed.oldest_article = 6
                pfeed.id_counter = len(opinionArticles)
                pfeed.articles = opinionArticles[:]
                feeds.append(pfeed)


          for curfeed in feeds:
                delList = []
                for a,curarticle in enumerate(curfeed.articles):
                      if curarticle.title.upper().find('ARTS AND LEISURE:') >= 0:
                            artsandleisureArticles.append(curarticle)
                            delList.append(curarticle)
                if len(delList)>0:
                      for d in delList:
                            index = curfeed.articles.index(d)
                            curfeed.articles[index:index+1] = []

          if len(artsandleisureArticles) > 0:
                pfeed = Feed()
                pfeed.title = 'Arts and Leisure'
                pfeed.descrition = 'Arts and Leisure Feed (Virtual)'
                pfeed.image_url  = None
                pfeed.oldest_article = 6
                pfeed.id_counter = len(artsandleisureArticles)
                pfeed.articles = artsandleisureArticles[:]
                feeds.append(pfeed)

          for curfeed in feeds:
                delList = []
                for a,curarticle in enumerate(curfeed.articles):
                      if curarticle.title.upper().find('SPORTS:') >= 0:
                            sportsArticles.append(curarticle)
                            delList.append(curarticle)
                if len(delList)>0:
                      for d in delList:
                            index = curfeed.articles.index(d)
                            curfeed.articles[index:index+1] = []

          if len(sportsArticles) > 0:
                pfeed = Feed()
                pfeed.title = 'Sports'
                pfeed.descrition = 'Sports Feed (Virtual)'
                pfeed.image_url  = None
                pfeed.oldest_article = 6
                pfeed.id_counter = len(sportsArticles)
                pfeed.articles = sportsArticles[:]
                feeds.append(pfeed)

          for curfeed in feeds:
                delList = []
                for a,curarticle in enumerate(curfeed.articles):
                      if curarticle.title.upper().find('ANNOUNCEMENTS:') >= 0:
                            announcementsArticles.append(curarticle)
                            delList.append(curarticle)
                if len(delList)>0:
                      for d in delList:
                            index = curfeed.articles.index(d)
                            curfeed.articles[index:index+1] = []

          if len(sportsArticles) > 0:
                pfeed = Feed()
                pfeed.title = 'Announcements'
                pfeed.descrition = 'Announcements Feed (Virtual)'
                pfeed.image_url  = None
                pfeed.oldest_article = 6
                pfeed.id_counter = len(announcementsArticles)
                pfeed.articles = announcementsArticles[:]
                feeds.append(pfeed)

          return feeds

    def print_version(self,url):

          baseURL='http://www.mahopacnews.com/LPprintwindow.LASSO?-token.editorialcall='
          segments = url.split('-')
          printURL = baseURL + segments[5]
        
          return printURL

Starson17 · 02-26-2011, 08:55 AM

Quote:

Originally Posted by Finbar127

Thanks again. Like you said my recipe was missing the following line:

Code:

 
from calibre.web.feeds import Feed

Adding it did the trick.

Here is the working recipe.

Thanks for confirming it works.
I added the import to the sticky code post, so others will be able to find it easily.

02-24-2011, 09:57 PM	#1
Finbar127 Member Posts: 11 Karma: 10 Join Date: Feb 2011 Device: Kindle 3	Create Article Sections From Feed? Hi Everyone, I have a local newspaper that lists all sections in the same feed: http://www.mahopacnews.com/rssheadlines.xml I have the following recipe to grab the articles Spoiler: Code: class AdvancedUserRecipe1297969350(BasicNewsRecipe): title = u'Mahopac News' oldest_article = 6 max_articles_per_feed = 100 no_stylesheets = True remove_attributes=['style'] remove_javascript = True conversion_options = {'linearize_tables' : True} remove_tags = [dict(name='span', attrs={'class':'lp'})] extra_css = '.title {font-size: x-large; font-weight: bold}' feeds = [(u'News', u'http://www.mahopacnews.com/rssheadlines.xml')] def print_version(self,url): baseURL='http://www.mahopacnews.com/LPprintwindow.LASSO?-token.editorialcall=' segments = url.split('-') printURL = baseURL + segments[5] return printURL However I get all the articles in one section. I would like to split the articles into their appropriate sections (News, Sports, Opinions, etc). On the feed page each article has the appropriate section name at the beginning of the article title. Eample: News: Wild guest wows Mahopac Middle School students News: Can Rotary Dream Team beat Harlem Magic Masters this year on March 11? Opinion: Dressing for the next ice age I added the following to the recipe which allowed me to filter the articles based upon a key word in the title: Code: def get_article_url(self, article): link = article.get('link') title = article.get('title') if 'News:' in title: return link Is there a way to modify this so Calibre will run this on each key word then separate the articles into their appropriate sections? Last edited by Finbar127; 02-24-2011 at 10:05 PM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Create clean custom Yahoo Rss feed	77ja55	Recipes	1	01-17-2011 09:33 AM
Decorate article headings as hyperlinks to full article?	tomsem	Recipes	5	10-15-2010 08:30 PM
How to create RSS feed for blogs?	mishicka	Calibre	2	02-11-2010 11:48 AM
New Forum Sections	gvtexas	Announcements	1	06-09-2003 09:05 AM

Advert

Advert