Thread: maya recipe
View Single Post
Old 09-28-2010, 04:23 PM   #2
TonytheBookworm
Addict
TonytheBookworm is on a distinguished road
 
TonytheBookworm's Avatar
 
Posts: 264
Karma: 62
Join Date: May 2010
Device: kindle 2, kindle 3, Kindle fire
On first look at that thing, why not do something in the area of this:
You said you wanted the second link
for example: http://maya.tase.co.il/bursa/report....port_cd=570152

it always has report_cd in it so why not just follow it with a regex match ?
Spoiler:

Code:
    from calibre.ptempfile import PersistentTemporaryFile

    temp_files = []
    articles_are_obfuscated = True

    def get_obfuscated_article(self, url):
        br = self.get_browser()
        br.open(url)
        response = br.follow_link(url_regex='?report_cd', nr = 0)
        html = response.read()
        self.temp_files.append(PersistentTemporaryFile('_fa.html'))
        self.temp_files[-1].write(html)
        self.temp_files[-1].close()
        return self.temp_files[-1].name


or maybe use something like this:
Spoiler:

Code:
    def parse_index(self):
        feeds = []
        for title, url in [
                            (u"Feed", u"http://maya.tase.co.il/bursa/index.asp?view=search&company_group=3000&arg_comp=&srh_comp_lb=1007&srh_from=2010-01-01&srh_until=2010-09-28&srh_anaf=-1&srh_event=9999&is_urgent=0&srh_company_press="),
                            
                            
                             ]:
            articles = self.make_links(url)
            if articles:
                feeds.append((title, articles))
        return feeds
        
    def make_links(self, url):
        title = 'Temp'
        current_articles = []
        soup = self.index_to_soup(url)
        print 'The soup is: ', soup
        for item in soup.findAll('a',attrs={'class':'A3'}):
            print 'item is: ',item
            
            if not re.search('javascript', item['href']):
              print 'FOUND GOOD URL'
              url = self.INDEX + item['href']
              print 'url is: ', url
              title       = self.tag_to_string(item)
              print 'title is: ', title
            current_articles.append({'title': title, 'url': url, 'description':'', 'date':''}) # append all this
            
           
        return current_articles

Last edited by TonytheBookworm; 09-28-2010 at 05:25 PM. Reason: edited code
TonytheBookworm is offline   Reply With Quote