View Single Post
Old 05-19-2010, 02:25 PM   #1950
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by gambarini View Post
i don't understand; can you give me an example?
Here's a standard usage. It may look complicated, but it's not that bad. A description is here.

Code:
    def parse_index(self):
            feeds = []
            for title, url in [('National', 'http://www.nzherald.co.nz/nz/news/headlines.cfm?c_id=1'),
                               ('World', 'http://www.nzherald.co.nz/world/news/headlines.cfm?c_id=2'),
                               ('Politics', 'http://www.nzherald.co.nz/politics/news/headlines.cfm?c_id=280'),
                               ('Crime', 'http://www.nzherald.co.nz/crime/news/headlines.cfm?c_id=30'),
                               ('Environment', 'http://www.nzherald.co.nz/environment/news/headlines.cfm?c_id=39'),
                              ]:
               articles = self.nz_parse_section(url)
               if articles:
                   feeds.append((title, articles))
            return feeds
        
    def nz_parse_section(self, url):
            soup = self.index_to_soup(url)
            div = soup.find(attrs={'class': 'col-300 categoryList'})
            date = div.find(attrs={'class': 'link-list-heading'})

            current_articles = []
            for tag in date.findAllNext(attrs = {'class': ['linkList', 'link-list-heading']}):
                if tag.get('class') == 'link-list-heading': 
                    break
                for li in tag.findAll('li'):
                    a = li.find('a', href = True)
                    if a is None:
                        continue
                    title = self.tag_to_string(a)
                    url = a.get('href', False)
                    if not url or not title:
                        continue
                    if url.startswith('/'):
                         url = 'http://www.nzherald.co.nz'+url
                    self.log('\t\tFound article:', title)
                    self.log('\t\t\t', url)
                    current_articles.append({'title': title, 'url': url, 'description':'', 'date':''})

            return current_articles
Basically, you use the parse_index method when you want to control the title, description and/or date on that page, and already know the URL. A common use is when you can't parse an RSS feed automatically, and have to parse a web page to get the URL. However, I've never actually used it for that. Instead, I use it when I can figure out the URL in advance, because it's simple and there is no page or RSS feed. (I believe I used it for several comics recipes to pull the previous comics). Those recipes should be in this thread somewhere under my name.

Quote:
p.s.

EXCUSE FOR MY POOR ENGLISH!
I have less trouble understanding you than many native English speakers. I'm jealous that your English is so much better than my second language. I'm sure all the Italian speakers appreciate your efforts to build recipes for Italian web-sites. Keep up the good work!

Last edited by Starson17; 05-19-2010 at 02:27 PM.
Starson17 is offline