Appending articles to a feed fails

leader_montanus · 05-05-2023, 09:38 PM

Hi,

I've been using Calibre for a few years and have also used the recipe function to download news every day. Most of the recipes I use are slightly modified and based on RSS feeds.

I'm currently stumped trying to add a few articles from a HTML page to a previously populated feed. I have looked at the example here of how to add articles and I have also looked at several existing recipes.

What I am doing is as follows:

Use a standard RSS feed definition:
Code:
```
feeds = [ ('Long Reads', 'https://longreads.com/feed/'), ]
```
(There are in fact several other RSS feeds here, for clarity I am showing only the relevant one)
I have created a parse_feeds function that first runs the base parse_feeds function, then loops through all the feedss/articles to checks for one particular page which is updated weekly (5 best long reads)

It then extracts the links on this page and tries to append them to the feeds list. The code is as follows:

Code:

 
    def parse_feeds(self):
    feeds = super(LongReads, self).parse_feeds()

    for articles in feeds:
      section = articles.title
      for article in articles:
        if article.url and 'longreads.com' in article.url:
          raw = browser().open_novisit(article.url).read()
          soup = BeautifulSoup(raw)
          newArticles = []
          for item in soup.findAll('a', attrs={'target': '_blank'}):
            if item.parent.name == 'h3':
              newArt = {}
              newArt['title'] = item.string
              newArt['url'] = item['href']
              newArticles.append(newArt)          
          feeds.append((section, newArticles))
    return feeds

An example of the page being downloaded can be seen here: https://longreads.com/2023/04/21/the...-the-week-462/

The links are extracted correctly, the issue is that I always get the error 'tuple' object has no attribute 'title'. The example I base it on is obviously old, but I also see several newer recipies where it works to use the append function for the feed.

Outputting the feeds array shows this (excerpt), so obviously the links are added incorrectly:

Code:

____________________
Title       : SolarWinds: The Untold Story of the Boldest Supply-Chain Hack
URL         : https://www.wired.com/story/the-untold-story-of-solarwinds-the-boldest-supply-chain-hack-ever/
Author      : Kim Zetter
Summary     : The attackers were i...
Date        : Tue, 02 May, 2023 12:00
TOC thumb   : None
Has content : False

, ('section', [{'title': '1. A Trucker’s Kidnapping, a Suspicious Ransom, and a Colorado Family’s Perilous Quest for Justice', 'url': 'https://www.5280.com/a-truckers-kidnapping-a-suspicious-ransom-and-a-colorado-familys-perilous-quest-for-justice/?src=longreads'},

I have also tried to create an array of Feed objects, when I use the append function it then complains that 'Feed' object has no attribute 'articles'.'

Grateful for any help with this, there is obviously something simple that I cannot see...

kovidgoyal · 05-05-2023, 11:57 PM

feeds is a list of Feed objects. The form (title, list of feeds) is used in parse_index() not parse_feeds().

leader_montanus · 05-06-2023, 03:51 PM

Quote:

Originally Posted by kovidgoyal

feeds is a list of Feed objects. The form (title, list of feeds) is used in parse_index() not parse_feeds().

Thanks for the quick response and for pointing me in the right direction, shows why it's worth supporting Calibre! This should also teach me not to program at night when I'm tired

The code that works in the end looks like this, using the built in feeds_from_index function to create feed objects:

Code:

# subclass parse_feeds and then add the links from the Long Reads HTML page to the feeds list
  def parse_feeds(self):
    feeds = super(LongReads, self).parse_feeds()

# Loop through existing articles until hit on the one from Long Reads website
    newArticles = []
    for curfeed in feeds:
      for a, curarticle in enumerate(curfeed.articles):
        
# found the Long Reads page, extract links and summary using standard BS function
        if curarticle.url and 'longreads.com' in curarticle.url:
          raw = browser().open_novisit(curarticle.url).read()
          soup = BeautifulSoup(raw)
          for item in soup.findAll('a', attrs={'target': '_blank'}):
            if item.parent.name == 'h3':
# found a link, create a new dictionary entry in basic article format and add to list
              newArticles.append({
                                  'title': item.string,
                                  'date': date.today(),
                                  'url': item['href'],
                                  'description': item.parent.findNext('p').findNext('p').contents[0]
                                  })

# If there are any links, create/append a new Feed object
          if len(newArticles) > 0:

# use built in function to create feed objects from list of dictionaries with article info
            newfeeds = feeds_from_index([('Long Reads', newArticles)], oldest_article=self.oldest_article,
                                      max_articles_per_feed=self.max_articles_per_feed)

# add the new feed objects to existing feed list, needs to be done one by one
            for newfeed in newfeeds: 
              feeds.append(newfeed)

# finally delete original page as it is just a link page
          feeds.pop(feeds.index(curfeed))
          return feeds

# in case Long Reads page not downloaded we have this catch-all for returning feeds
    return feeds

05-05-2023, 09:38 PM	#1
leader_montanus Junior Member Posts: 9 Karma: 10 Join Date: May 2023 Device: Onyx Boox Nova Air	Appending articles to a feed fails Hi, I've been using Calibre for a few years and have also used the recipe function to download news every day. Most of the recipes I use are slightly modified and based on RSS feeds. I'm currently stumped trying to add a few articles from a HTML page to a previously populated feed. I have looked at the example here of how to add articles and I have also looked at several existing recipes. What I am doing is as follows: Use a standard RSS feed definition: Code: feeds = [ ('Long Reads', 'https://longreads.com/feed/'), ] (There are in fact several other RSS feeds here, for clarity I am showing only the relevant one) I have created a parse_feeds function that first runs the base parse_feeds function, then loops through all the feedss/articles to checks for one particular page which is updated weekly (5 best long reads) It then extracts the links on this page and tries to append them to the feeds list. The code is as follows: Code: def parse_feeds(self): feeds = super(LongReads, self).parse_feeds() for articles in feeds: section = articles.title for article in articles: if article.url and 'longreads.com' in article.url: raw = browser().open_novisit(article.url).read() soup = BeautifulSoup(raw) newArticles = [] for item in soup.findAll('a', attrs={'target': '_blank'}): if item.parent.name == 'h3': newArt = {} newArt['title'] = item.string newArt['url'] = item['href'] newArticles.append(newArt) feeds.append((section, newArticles)) return feeds An example of the page being downloaded can be seen here: https://longreads.com/2023/04/21/the...-the-week-462/ The links are extracted correctly, the issue is that I always get the error 'tuple' object has no attribute 'title'. The example I base it on is obviously old, but I also see several newer recipies where it works to use the append function for the feed. Outputting the feeds array shows this (excerpt), so obviously the links are added incorrectly: Code: ____________________ Title : SolarWinds: The Untold Story of the Boldest Supply-Chain Hack URL : https://www.wired.com/story/the-untold-story-of-solarwinds-the-boldest-supply-chain-hack-ever/ Author : Kim Zetter Summary : The attackers were i... Date : Tue, 02 May, 2023 12:00 TOC thumb : None Has content : False , ('section', [{'title': '1. A Trucker’s Kidnapping, a Suspicious Ransom, and a Colorado Family’s Perilous Quest for Justice', 'url': 'https://www.5280.com/a-truckers-kidnapping-a-suspicious-ransom-and-a-colorado-familys-perilous-quest-for-justice/?src=longreads'}, I have also tried to create an array of Feed objects, when I use the append function it then complains that 'Feed' object has no attribute 'articles'.' Grateful for any help with this, there is obviously something simple that I cannot see...

Thread Tools	Search this Thread
Show Printable Version Email this Page	Search this Thread: Advanced Search

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Appending URLs in an RSS feed	Phoebus	Recipes	2	08-10-2019 03:16 PM
Feed is titled "all articles" if only one list of articles is found	sup	Recipes	0	11-30-2013 05:31 PM
Articles repeated in different feed sections	scissors	Recipes	8	10-19-2012 11:27 AM
The Age Feed - repeat articles	Quasii	Recipes	2	03-09-2011 06:38 PM
Sorting articles of RSS feed	miwie	Recipes	1	11-21-2010 01:02 AM

05-05-2023, 11:57 PM	#2
kovidgoyal creator of calibre Posts: 45,345 Karma: 27182818 Join Date: Oct 2006 Location: Mumbai, India Device: Various	feeds is a list of Feed objects. The form (title, list of feeds) is used in parse_index() not parse_feeds().

Advert