Trouble turning a non-RSS webpage into a "feed"

atsiong1 · 04-23-2017, 04:37 PM

Hi, I have been trying to rework this custom recipe example for the New York Times to create a custom recipe that will pull all the articles from this webpage (not an RSS feed).

In theory it seems straightforward enough – I have identified which html elements contain the feed (ul), articles (li), article title (a), article link (a href) and author name (i). But I am new to Python and to recipes, and each of my attempts so far has resulted in a “TypeError: 'NoneType' object is not iterable.”
My attempt:

Code:

import string, re
from calibre import strftime
from calibre.web.feeds.recipes import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup

class Adoption(BasicNewsRecipe):

    title       = 'Transracial Adoption/Interracial Adoption'
    __author__  = 'Mrs. Magoo'
    description = 'Articles from Pact Adopt'
    timefmt = ' [%a, %d %b, %Y]'
    remove_tags_before = dict(name='li')
    remove_tags_after  = dict(name='li')


    def parse_index(self):
        soup = self.index_to_soup('http://www.pactadopt.org/resources/transracial-adoption-interracial-adoption.html')

        def feed_title(ul):
            return ''.join(ul.findAll(text=True, recursive=False)).strip()

        articles = {}
        key = None
        ans = []
        for ul in soup.findAll(True,
             attrs={'name':['li']}):

                 url = re.sub(r'\?.*', '', a['href'])
                 title = self.tag_to_string(a, use_alt=True).strip()
                 author = self.tag_to_string(i, use_alt=True).strip()
                 description = ''
                 pubdate = strftime('%Y')
                 summary = ''

I’m sure I must be misunderstanding how the elements of my webpage map to the structure of the NYT recipe.

Does anyone have any pointers? I’ve really been enjoying using Calibre to pull in RSS feeds and would love to expand my skills to non-RSS webpages as well.

Thanks!

kovidgoyal · 04-23-2017, 10:37 PM

That error means one of your findAll/find() calls is not finding anything.

Looking over your recipe quickly, I see for example, findAll(attrs={'name':'li'})

If you want to find an <li> tags you do

findAll('li')

atsiong1 · 04-25-2017, 07:30 PM

Thank you!

04-23-2017, 04:37 PM	#1
atsiong1 Junior Member Posts: 4 Karma: 10 Join Date: Apr 2017 Device: Kindle	Trouble turning a non-RSS webpage into a "feed" Hi, I have been trying to rework this custom recipe example for the New York Times to create a custom recipe that will pull all the articles from this webpage (not an RSS feed). In theory it seems straightforward enough – I have identified which html elements contain the feed (ul), articles (li), article title (a), article link (a href) and author name (i). But I am new to Python and to recipes, and each of my attempts so far has resulted in a “TypeError: 'NoneType' object is not iterable.” My attempt: Code: import string, re from calibre import strftime from calibre.web.feeds.recipes import BasicNewsRecipe from calibre.ebooks.BeautifulSoup import BeautifulSoup class Adoption(BasicNewsRecipe): title = 'Transracial Adoption/Interracial Adoption' __author__ = 'Mrs. Magoo' description = 'Articles from Pact Adopt' timefmt = ' [%a, %d %b, %Y]' remove_tags_before = dict(name='li') remove_tags_after = dict(name='li') def parse_index(self): soup = self.index_to_soup('http://www.pactadopt.org/resources/transracial-adoption-interracial-adoption.html') def feed_title(ul): return ''.join(ul.findAll(text=True, recursive=False)).strip() articles = {} key = None ans = [] for ul in soup.findAll(True, attrs={'name':['li']}): url = re.sub(r'\?.*', '', a['href']) title = self.tag_to_string(a, use_alt=True).strip() author = self.tag_to_string(i, use_alt=True).strip() description = '' pubdate = strftime('%Y') summary = '' I’m sure I must be misunderstanding how the elements of my webpage map to the structure of the NYT recipe. Does anyone have any pointers? I’ve really been enjoying using Calibre to pull in RSS feeds and would love to expand my skills to non-RSS webpages as well. Thanks!

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
There are "RSS reader" and "mail client" hacks for kobo?	Seninha	Kobo Reader	2	09-30-2014 11:02 PM
No Author in RSS-Feed "newest"	dosser	Recipes	0	09-13-2013 09:53 AM
New recipe for german RSS feed of "Buchreport.de"	a.peter	Recipes	1	11-16-2012 07:30 AM
Trouble with RSS Feed	remlap	Recipes	0	10-25-2012 12:46 PM
Recipe for german RSS feed "Leipziger Volkszeitung"	a.peter	Recipes	0	09-28-2011 03:05 AM

04-23-2017, 10:37 PM	#2
kovidgoyal creator of calibre Posts: 43,857 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	That error means one of your findAll/find() calls is not finding anything. Looking over your recipe quickly, I see for example, findAll(attrs={'name':'li'}) If you want to find an <li> tags you do findAll('li')

04-25-2017, 07:30 PM	#3
atsiong1 Junior Member Posts: 4 Karma: 10 Join Date: Apr 2017 Device: Kindle	Thank you!

Advert