MobileRead Forums - View Single Post - Trouble turning a non-RSS webpage into a "feed"

atsiong1 · 04-23-2017, 05:37 PM

Hi, I have been trying to rework this custom recipe example for the New York Times to create a custom recipe that will pull all the articles from this webpage (not an RSS feed).

In theory it seems straightforward enough – I have identified which html elements contain the feed (ul), articles (li), article title (a), article link (a href) and author name (i). But I am new to Python and to recipes, and each of my attempts so far has resulted in a “TypeError: 'NoneType' object is not iterable.”
My attempt:

Code:

import string, re
from calibre import strftime
from calibre.web.feeds.recipes import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup

class Adoption(BasicNewsRecipe):

    title       = 'Transracial Adoption/Interracial Adoption'
    __author__  = 'Mrs. Magoo'
    description = 'Articles from Pact Adopt'
    timefmt = ' [%a, %d %b, %Y]'
    remove_tags_before = dict(name='li')
    remove_tags_after  = dict(name='li')


    def parse_index(self):
        soup = self.index_to_soup('http://www.pactadopt.org/resources/transracial-adoption-interracial-adoption.html')

        def feed_title(ul):
            return ''.join(ul.findAll(text=True, recursive=False)).strip()

        articles = {}
        key = None
        ans = []
        for ul in soup.findAll(True,
             attrs={'name':['li']}):

                 url = re.sub(r'\?.*', '', a['href'])
                 title = self.tag_to_string(a, use_alt=True).strip()
                 author = self.tag_to_string(i, use_alt=True).strip()
                 description = ''
                 pubdate = strftime('%Y')
                 summary = ''

I’m sure I must be misunderstanding how the elements of my webpage map to the structure of the NYT recipe.

Does anyone have any pointers? I’ve really been enjoying using Calibre to pull in RSS feeds and would love to expand my skills to non-RSS webpages as well.

Thanks!

04-23-2017, 05:37 PM	#1
atsiong1 Junior Member Posts: 4 Karma: 10 Join Date: Apr 2017 Device: Kindle	Trouble turning a non-RSS webpage into a "feed" Hi, I have been trying to rework this custom recipe example for the New York Times to create a custom recipe that will pull all the articles from this webpage (not an RSS feed). In theory it seems straightforward enough – I have identified which html elements contain the feed (ul), articles (li), article title (a), article link (a href) and author name (i). But I am new to Python and to recipes, and each of my attempts so far has resulted in a “TypeError: 'NoneType' object is not iterable.” My attempt: Code: import string, re from calibre import strftime from calibre.web.feeds.recipes import BasicNewsRecipe from calibre.ebooks.BeautifulSoup import BeautifulSoup class Adoption(BasicNewsRecipe): title = 'Transracial Adoption/Interracial Adoption' __author__ = 'Mrs. Magoo' description = 'Articles from Pact Adopt' timefmt = ' [%a, %d %b, %Y]' remove_tags_before = dict(name='li') remove_tags_after = dict(name='li') def parse_index(self): soup = self.index_to_soup('http://www.pactadopt.org/resources/transracial-adoption-interracial-adoption.html') def feed_title(ul): return ''.join(ul.findAll(text=True, recursive=False)).strip() articles = {} key = None ans = [] for ul in soup.findAll(True, attrs={'name':['li']}): url = re.sub(r'\?.*', '', a['href']) title = self.tag_to_string(a, use_alt=True).strip() author = self.tag_to_string(i, use_alt=True).strip() description = '' pubdate = strftime('%Y') summary = '' I’m sure I must be misunderstanding how the elements of my webpage map to the structure of the NYT recipe. Does anyone have any pointers? I’ve really been enjoying using Calibre to pull in RSS feeds and would love to expand my skills to non-RSS webpages as well. Thanks!