MobileRead Forums - View Single Post - Custom recipes (archive, read-only)

Starson17 · 05-19-2010, 02:25 PM

Quote:

Originally Posted by gambarini

i don't understand; can you give me an example?

Here's a standard usage. It may look complicated, but it's not that bad. A description is here.

Code:

    def parse_index(self):
            feeds = []
            for title, url in [('National', 'http://www.nzherald.co.nz/nz/news/headlines.cfm?c_id=1'),
                               ('World', 'http://www.nzherald.co.nz/world/news/headlines.cfm?c_id=2'),
                               ('Politics', 'http://www.nzherald.co.nz/politics/news/headlines.cfm?c_id=280'),
                               ('Crime', 'http://www.nzherald.co.nz/crime/news/headlines.cfm?c_id=30'),
                               ('Environment', 'http://www.nzherald.co.nz/environment/news/headlines.cfm?c_id=39'),
                              ]:
               articles = self.nz_parse_section(url)
               if articles:
                   feeds.append((title, articles))
            return feeds
        
    def nz_parse_section(self, url):
            soup = self.index_to_soup(url)
            div = soup.find(attrs={'class': 'col-300 categoryList'})
            date = div.find(attrs={'class': 'link-list-heading'})

            current_articles = []
            for tag in date.findAllNext(attrs = {'class': ['linkList', 'link-list-heading']}):
                if tag.get('class') == 'link-list-heading': 
                    break
                for li in tag.findAll('li'):
                    a = li.find('a', href = True)
                    if a is None:
                        continue
                    title = self.tag_to_string(a)
                    url = a.get('href', False)
                    if not url or not title:
                        continue
                    if url.startswith('/'):
                         url = 'http://www.nzherald.co.nz'+url
                    self.log('\t\tFound article:', title)
                    self.log('\t\t\t', url)
                    current_articles.append({'title': title, 'url': url, 'description':'', 'date':''})

            return current_articles

Basically, you use the parse_index method when you want to control the title, description and/or date on that page, and already know the URL. A common use is when you can't parse an RSS feed automatically, and have to parse a web page to get the URL. However, I've never actually used it for that. Instead, I use it when I can figure out the URL in advance, because it's simple and there is no page or RSS feed. (I believe I used it for several comics recipes to pull the previous comics). Those recipes should be in this thread somewhere under my name.

Quote:

p.s.

EXCUSE FOR MY POOR ENGLISH!

I have less trouble understanding you than many native English speakers. I'm jealous that your English is so much better than my second language. I'm sure all the Italian speakers appreciate your efforts to build recipes for Italian web-sites. Keep up the good work!