Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 02-24-2011, 09:57 PM   #1
Finbar127
Member
Finbar127 began at the beginning.
 
Posts: 11
Karma: 10
Join Date: Feb 2011
Device: Kindle 3
Create Article Sections From Feed?

Hi Everyone,

I have a local newspaper that lists all sections in the same feed:

http://www.mahopacnews.com/rssheadlines.xml

I have the following recipe to grab the articles

Spoiler:
Code:
class AdvancedUserRecipe1297969350(BasicNewsRecipe):
    title = u'Mahopac News'
    oldest_article = 6
    max_articles_per_feed = 100
    no_stylesheets = True
    remove_attributes=['style'] 
    remove_javascript = True
    conversion_options = {'linearize_tables' : True}
    remove_tags = [dict(name='span', attrs={'class':'lp'})]
    extra_css = '.title {font-size: x-large; font-weight: bold}'

    feeds = [(u'News', u'http://www.mahopacnews.com/rssheadlines.xml')]
    
    def print_version(self,url):

          baseURL='http://www.mahopacnews.com/LPprintwindow.LASSO?-token.editorialcall='
          segments = url.split('-')
          printURL = baseURL + segments[5]
        
          return printURL


However I get all the articles in one section. I would like to split the articles into their appropriate sections (News, Sports, Opinions, etc). On the feed page each article has the appropriate section name at the beginning of the article title. Eample:

News: Wild guest wows Mahopac Middle School students
News: Can Rotary Dream Team beat Harlem Magic Masters this year on March 11?
Opinion: Dressing for the next ice age

I added the following to the recipe which allowed me to filter the articles based upon a key word in the title:

Code:
def get_article_url(self, article): 
        link = article.get('link')
        title = article.get('title')
        if 'News:' in title:
             return link
Is there a way to modify this so Calibre will run this on each key word then separate the articles into their appropriate sections?

Last edited by Finbar127; 02-24-2011 at 10:05 PM.
Finbar127 is offline   Reply With Quote
Old 02-25-2011, 07:48 AM   #2
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by Finbar127 View Post
Is there a way to modify this so Calibre will run this on each key word then separate the articles into their appropriate sections?
See here:
https://www.mobileread.com/forums/sho...45&postcount=2
Starson17 is offline   Reply With Quote
Advert
Old 02-25-2011, 05:21 PM   #3
Finbar127
Member
Finbar127 began at the beginning.
 
Posts: 11
Karma: 10
Join Date: Feb 2011
Device: Kindle 3
Thanks. I tried adjusting that code to fit my recipe and I ended up with this:

Spoiler:
Code:
class AdvancedUserRecipe1297969350(BasicNewsRecipe):
    title = u'Mahopac News'
    oldest_article = 6
    max_articles_per_feed = 100
    no_stylesheets = True
    remove_attributes=['style'] 
    remove_javascript = True
    conversion_options = {'linearize_tables' : True}
    remove_tags = [dict(name='span', attrs={'class':'lp'})]
    extra_css = '.title {font-size: x-large; font-weight: bold}'

    feeds = [(u'Headlines', u'http://www.mahopacnews.com/rssheadlines.xml')]

    def parse_feeds (self):
          feeds = BasicNewsRecipe.parse_feeds(self)
          opinionArticles = []
          for curfeed in feeds:
                delList = []
                for a,curarticle in enumerate(curfeed.articles):
                      if curarticle.title.upper().find('OPINION:') >= 0:
                            opinionArticles.append(curarticle)
                            delList.append(curarticle)
                if len(delList)>0:
                      for d in delList:
                            index = curfeed.articles.index(d)
                            curfeed.articles[index:index+1] = []

          if len(opinionArticles) > 0:
                pfeed = Feed()
                pfeed.title = 'Opinion'
                pfeed.descrition = 'Opinion Feed (Virtual)'
                pfeed.image_url  = None
                pfeed.oldest_article = 30
                pfeed.id_counter = len(recipeArticles)
                pfeed.articles = opinionArticles[:]
                feeds.append(pfeed)

          return feeds

    def print_version(self,url):

          baseURL='http://www.mahopacnews.com/LPprintwindow.LASSO?-token.editorialcall='
          segments = url.split('-')
          printURL = baseURL + segments[5]
        
          return printURL


I get the following error:

line 32, in parse_feeds
pfeed = Feed()
NameError: global name 'Feed' is not defined

Do you know where I would define Feed in the recipe? Also Could you point me to a recipe where this particular chunk of code is used?

Last edited by Finbar127; 02-25-2011 at 05:28 PM.
Finbar127 is offline   Reply With Quote
Old 02-25-2011, 08:54 PM   #4
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by Finbar127 View Post
Do you know where I would define Feed in the recipe? Also Could you point me to a recipe where this particular chunk of code is used?
from calibre.web.feeds import Feed
Reader's Digest follows:
Spoiler:
Code:
#!/usr/bin/env  python
__license__   = 'GPL v3'
'''
'''
from calibre.web.feeds.recipes import BasicNewsRecipe
from calibre.web.feeds import Feed


class ReadersDigest(BasicNewsRecipe):

    title       = 'Readers Digest'
    __author__  = 'BrianG'
    language = 'en'
    description = 'Readers Digest Feeds'
    no_stylesheets        = True
    use_embedded_content  = False
    oldest_article = 60
    max_articles_per_feed = 200

    language = 'en'
    remove_javascript     = True

    extra_css      = ''' h1 {font-family:georgia,serif;color:#000000;}
                        .mainHd{font-family:georgia,serif;color:#000000;}
                         h2 {font-family:Arial,Sans-serif;}
                        .name{font-family:Arial,Sans-serif; font-size:x-small;font-weight:bold; }
                        .date{font-family:Arial,Sans-serif; font-size:x-small ;color:#999999;}
                        .byline{font-family:Arial,Sans-serif; font-size:x-small ;}
                        .photoBkt{ font-size:x-small ;}
                        .vertPhoto{font-size:x-small ;}
                        .credits{font-family:Arial,Sans-serif; font-size:x-small ;color:gray;}
                        .credit{font-family:Arial,Sans-serif; font-size:x-small ;color:gray;}
                        .artTxt{font-family:georgia,serif;}
                        .caption{font-family:georgia,serif; font-size:x-small;color:#333333;}
                        .credit{font-family:georgia,serif; font-size:x-small;color:#999999;}
                        a:link{color:#CC0000;}
                        .breadcrumb{font-family:Arial,Sans-serif;font-size:x-small;}
                        '''


    remove_tags = [
        dict(name='h4', attrs={'class':'close'}),
        dict(name='div', attrs={'class':'fromLine'}),
        dict(name='img', attrs={'class':'colorTag'}),
        dict(name='div', attrs={'id':'sponsorArticleHeader'}),
        dict(name='div', attrs={'class':'horizontalAd'}),
        dict(name='div', attrs={'id':'imageCounterLeft'}),
        dict(name='div', attrs={'id':'commentsPrint'})
        ]


    feeds = [
            ('New in RD', 'http://feeds.rd.com/ReadersDigest'),
            ('Jokes', 'http://feeds.rd.com/ReadersDigestJokes'),
            ('Cartoons', 'http://feeds.rd.com/ReadersDigestCartoons'),
            ('Blogs','http://feeds.rd.com/ReadersDigestBlogs')
        ]

    cover_url = 'http://www.rd.com/images/logo-main-rd.gif'



#-------------------------------------------------------------------------------------------------

    def print_version(self, url):

        # Get the identity number of the current article and append it to the root print URL

        if url.find('/article') > 0:
            ident = url[url.find('/article')+8:url.find('.html?')-4]
            url = 'http://www.rd.com/content/printContent.do?contentId=' + ident

        elif url.find('/post') > 0:

            # in this case, have to get the page itself to derive the Print page.
            soup = self.index_to_soup(url)
            newsoup = soup.find('ul',attrs={'class':'printBlock'})
            url = 'http://www.rd.com' + newsoup('a')[0]['href']
            url = url[0:url.find('&Keep')]

        return url

#-------------------------------------------------------------------------------------------------

    def parse_index(self):

        pages = [
                ('Your America','http://www.rd.com/your-america-inspiring-people-and-stories', 'channelLeftContainer',{'class':'moreLeft'}),
                # useless recipes ('Living Healthy','http://www.rd.com/living-healthy', 'channelLeftContainer',{'class':'moreLeft'}),
                ('Advice and Know-How','http://www.rd.com/advice-and-know-how', 'channelLeftContainer',{'class':'moreLeft'})

            ]

        feeds = []

        for page in pages:
            section, url, divider, attrList = page
            newArticles = self.page_parse(url, divider, attrList)
            feeds.append((section,newArticles))

        # after the pages of the site have been processed, parse several RSS feeds for additional sections
        newfeeds = Feed()
        newfeeds = self.parse_rss()


        # The utility code in parse_rss returns a Feed object.  Convert each feed/article combination into a form suitable
        # for this module (parse_index).

        for feed in newfeeds:
            newArticles = []
            for article in feed.articles:
                newArt = {
                            'title' : article.title,
                            'url'   : article.url,
                            'date'  : article.date,
                            'description' : article.text_summary
                        }
                newArticles.append(newArt)


            # New and Blogs should be the first two feeds.
            if feed.title == 'New in RD':
                feeds.insert(0,(feed.title,newArticles))
            elif feed.title == 'Blogs':
                feeds.insert(1,(feed.title,newArticles))
            else:
                feeds.append((feed.title,newArticles))


        return feeds

#-------------------------------------------------------------------------------------------------

    def page_parse(self, mainurl, divider, attrList):

        articles = []
        mainsoup = self.index_to_soup(mainurl)
        for item in mainsoup.findAll(attrs=attrList):
            newArticle = {
                        'title' : item('img')[0]['alt'],
                        'url'   : 'http://www.rd.com'+item('a')[0]['href'],
                        'date'  : '',
                        'description' : ''
                    }
            articles.append(newArticle)



        return articles



#-------------------------------------------------------------------------------------------------

    def parse_rss (self):

        # Do the "official" parse_feeds first
        feeds = BasicNewsRecipe.parse_feeds(self)


        # Loop thru the articles in all feeds to find articles with "recipe" in it
        recipeArticles = []
        for curfeed in feeds:
            delList = []
            for a,curarticle in enumerate(curfeed.articles):
                if curarticle.title.upper().find('RECIPE') >= 0:
                    recipeArticles.append(curarticle)
                    delList.append(curarticle)
            if len(delList)>0:
                for d in delList:
                    index = curfeed.articles.index(d)
                    curfeed.articles[index:index+1] = []

        # If there are any recipes found, create a new Feed object and append.
        if len(recipeArticles) > 0:
            pfeed = Feed()
            pfeed.title = 'Recipes'
            pfeed.descrition = 'Recipe Feed (Virtual)'
            pfeed.image_url  = None
            pfeed.oldest_article = 30
            pfeed.id_counter = len(recipeArticles)
            # Create a new Feed, add the recipe articles, and then append
            # to "official" list of feeds
            pfeed.articles = recipeArticles[:]
            feeds.append(pfeed)

        return feeds

Last edited by Starson17; 02-25-2011 at 09:21 PM.
Starson17 is offline   Reply With Quote
Old 02-26-2011, 12:18 AM   #5
Finbar127
Member
Finbar127 began at the beginning.
 
Posts: 11
Karma: 10
Join Date: Feb 2011
Device: Kindle 3
Thanks again. Like you said my recipe was missing the following line:

Code:
 

from calibre.web.feeds import Feed
Adding it did the trick.

Here is the working recipe. It can probably be optimized but it basically works and the resulting file looks great on my Kindle3:

Spoiler:
Code:
from calibre.web.feeds.recipes import BasicNewsRecipe
from calibre.web.feeds import Feed

class AdvancedUserRecipe1297969350(BasicNewsRecipe):
    title = u'Mahopac News'
    description = 'Mahopac News Features'
    oldest_article = 6
    max_articles_per_feed = 100
    no_stylesheets = True
    remove_attributes=['style'] 
    remove_javascript = True
    conversion_options = {'linearize_tables' : True}
    remove_tags = [dict(name='span', attrs={'class':'lp'})]
    extra_css = '.title {font-size: x-large; font-weight: bold}'

    feeds = [(u' ', u'http://www.mahopacnews.com/rssheadlines.xml')]

    def parse_feeds (self):
          feeds = BasicNewsRecipe.parse_feeds(self)
          newsArticles = []
          opinionArticles = []
          artsandleisureArticles = []
          sportsArticles = []
          announcementsArticles = []

          for curfeed in feeds:
                delList = []
                for a,curarticle in enumerate(curfeed.articles):
                      if curarticle.title.upper().find('NEWS:') >= 0:
                            newsArticles.append(curarticle)
                            delList.append(curarticle)
                if len(delList)>0:
                      for d in delList:
                            index = curfeed.articles.index(d)
                            curfeed.articles[index:index+1] = []

          if len(newsArticles) > 0:
                pfeed = Feed()
                pfeed.title = 'News'
                pfeed.descrition = 'News Feed (Virtual)'
                pfeed.image_url  = None
                pfeed.oldest_article = 6
                pfeed.id_counter = len(newsArticles)
                pfeed.articles = newsArticles[:]
                feeds.append(pfeed)
  
          for curfeed in feeds:
                delList = []
                for a,curarticle in enumerate(curfeed.articles):
                      if curarticle.title.upper().find('OPINION:') >= 0:
                            opinionArticles.append(curarticle)
                            delList.append(curarticle)
                if len(delList)>0:
                      for d in delList:
                            index = curfeed.articles.index(d)
                            curfeed.articles[index:index+1] = []

          if len(opinionArticles) > 0:
                pfeed = Feed()
                pfeed.title = 'Opinion'
                pfeed.descrition = 'Opinion Feed (Virtual)'
                pfeed.image_url  = None
                pfeed.oldest_article = 6
                pfeed.id_counter = len(opinionArticles)
                pfeed.articles = opinionArticles[:]
                feeds.append(pfeed)


          for curfeed in feeds:
                delList = []
                for a,curarticle in enumerate(curfeed.articles):
                      if curarticle.title.upper().find('ARTS AND LEISURE:') >= 0:
                            artsandleisureArticles.append(curarticle)
                            delList.append(curarticle)
                if len(delList)>0:
                      for d in delList:
                            index = curfeed.articles.index(d)
                            curfeed.articles[index:index+1] = []

          if len(artsandleisureArticles) > 0:
                pfeed = Feed()
                pfeed.title = 'Arts and Leisure'
                pfeed.descrition = 'Arts and Leisure Feed (Virtual)'
                pfeed.image_url  = None
                pfeed.oldest_article = 6
                pfeed.id_counter = len(artsandleisureArticles)
                pfeed.articles = artsandleisureArticles[:]
                feeds.append(pfeed)

          for curfeed in feeds:
                delList = []
                for a,curarticle in enumerate(curfeed.articles):
                      if curarticle.title.upper().find('SPORTS:') >= 0:
                            sportsArticles.append(curarticle)
                            delList.append(curarticle)
                if len(delList)>0:
                      for d in delList:
                            index = curfeed.articles.index(d)
                            curfeed.articles[index:index+1] = []

          if len(sportsArticles) > 0:
                pfeed = Feed()
                pfeed.title = 'Sports'
                pfeed.descrition = 'Sports Feed (Virtual)'
                pfeed.image_url  = None
                pfeed.oldest_article = 6
                pfeed.id_counter = len(sportsArticles)
                pfeed.articles = sportsArticles[:]
                feeds.append(pfeed)

          for curfeed in feeds:
                delList = []
                for a,curarticle in enumerate(curfeed.articles):
                      if curarticle.title.upper().find('ANNOUNCEMENTS:') >= 0:
                            announcementsArticles.append(curarticle)
                            delList.append(curarticle)
                if len(delList)>0:
                      for d in delList:
                            index = curfeed.articles.index(d)
                            curfeed.articles[index:index+1] = []

          if len(sportsArticles) > 0:
                pfeed = Feed()
                pfeed.title = 'Announcements'
                pfeed.descrition = 'Announcements Feed (Virtual)'
                pfeed.image_url  = None
                pfeed.oldest_article = 6
                pfeed.id_counter = len(announcementsArticles)
                pfeed.articles = announcementsArticles[:]
                feeds.append(pfeed)

          return feeds

    def print_version(self,url):

          baseURL='http://www.mahopacnews.com/LPprintwindow.LASSO?-token.editorialcall='
          segments = url.split('-')
          printURL = baseURL + segments[5]
        
          return printURL

Last edited by Finbar127; 02-26-2011 at 12:21 AM.
Finbar127 is offline   Reply With Quote
Advert
Old 02-26-2011, 08:55 AM   #6
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by Finbar127 View Post
Thanks again. Like you said my recipe was missing the following line:

Code:
 
from calibre.web.feeds import Feed
Adding it did the trick.

Here is the working recipe.
Thanks for confirming it works.
I added the import to the sticky code post, so others will be able to find it easily.
Starson17 is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Create clean custom Yahoo Rss feed 77ja55 Recipes 1 01-17-2011 09:33 AM
Decorate article headings as hyperlinks to full article? tomsem Recipes 5 10-15-2010 08:30 PM
How to create RSS feed for blogs? mishicka Calibre 2 02-11-2010 11:48 AM
New Forum Sections gvtexas Announcements 1 06-09-2003 09:05 AM


All times are GMT -4. The time now is 07:08 PM.


MobileRead.com is a privately owned, operated and funded community.