MobileRead Forums - View Single Post

marbs · 09-28-2010, 03:57 PM

this one is hard, and it is Hebrew only.

i want to do this step by step, so i understand what i am doing.
i want to create a recipe for this page and similar pages.

if you go in to the page, you will see a list of articles (right hand side of the page). the actual link to the article is the 2nd link in each pair. i have recognized that that all the relevant links (and only them) have an id="SubjectHref*" (the * represents some numbers).

the URLs i want to get in stage one is 'http://maya.tase.co.il/' + [the href from tag <a> with id SubjectHref*].
i have then need to do the same in the next page (see the bottom of the page)

this is the code i have so far and i am a little lost now. its built on the NZ herald recipe. can some one tell me if this is the right way?

Spoiler:

Code:

from calibre.web.feeds.recipes import BasicNewsRecipe

class NewZealandHerald(BasicNewsRecipe):

    title       = 'maya recipe'
    __author__  = 'marbs'
    description = 'Daily news'
    timefmt = ' [%d %b, %Y]'
    language = '_Hebrew'

#    no_stylesheets = True
#     remove_tags_before = dict(name='div', attrs={'class':'contentContainer left eight'})
   # remove_tags_after  = dict(name='div', attrs={'class':'callToAction'})
   # remove_tags = [
   #    dict(name='iframe'),
   #    dict(name='div', attrs={'class':['sectionHeader', 'tools','callToAction', 'contentContainer right two nopad relatedColumn']}),
       #dict(name='div', attrs={'id':['shareContainer']}),
       #dict(name='form', attrs={'onsubmit':"return verifySearch(this.w,'Keyword, citation, or #author')"}),
       #dict(name='table', attrs={'cellspacing':'0'}),
#    ]

#    def preprocess_html(self, soup):
#        table = soup.find('table')
#        if table is not None:
#            table.extract()
#        return soup

    #TO GET ARTICLES IN SECTION
    def maya_parse_section(self, url):
            soup = self.index_to_soup(url)
            div = soup.find(attrs={'id':('SubjectHref'+"*")})
            date = div.find(attrs={'href'})

            current_articles = []
            for x in date.findAllNext(attrs = {'id': ('SubjectHref'+"*")}):
                if x = 30: break
                for li in x.findAll('li'):
                    a = li.find('a', href=True)
                    if a is None:
                        continue
                    title = self.tag_to_string(a)
                    url = a.get('href', False)
                    if not url or not title:
                        continue
                    if url.startswith('/'):
                         url = 'http://www.nzherald.co.nz'+url
                    self.log('\t\tFound article:', title)
                    self.log('\t\t\t', url)
                    current_articles.append({'title': title, 'url':url,
                        'description':'', 'date':''})

            return current_articles


    # To GET SECTIONS
    def parse_index(self):
            feeds = ['example feed', 'http://maya.tase.co.il/bursa/index.asp?view=search&company_group=3000&arg_comp=&srh_comp_lb=1007&srh_from=2010-01-01&srh_until=2010-09-28&srh_anaf=-1&srh_event=9999&is_urgent=0&srh_company_press=']

               articles = self.maya_parse_section(url)
               if articles:
                   feeds.append((title, articles))
            return feeds