this one is hard, and it is Hebrew only.
i want to do this step by step, so i understand what i am doing.
i want to create a recipe for
this page and similar pages.
if you go in to the page, you will see a list of articles (right hand side of the page). the actual link to the article is the 2nd link in each pair. i have recognized that that all the relevant links (and only them) have an id="SubjectHref*" (the * represents some numbers).
the URLs i want to get in stage one is 'http://maya.tase.co.il/' + [the href from tag <a> with id SubjectHref*].
i have then need to do the same in the next page (see the bottom of the page)
this is the code i have so far and i am a little lost now. its built on the NZ herald recipe. can some one tell me if this is the right way?
Spoiler:
Code:
from calibre.web.feeds.recipes import BasicNewsRecipe
class NewZealandHerald(BasicNewsRecipe):
title = 'maya recipe'
__author__ = 'marbs'
description = 'Daily news'
timefmt = ' [%d %b, %Y]'
language = '_Hebrew'
# no_stylesheets = True
# remove_tags_before = dict(name='div', attrs={'class':'contentContainer left eight'})
# remove_tags_after = dict(name='div', attrs={'class':'callToAction'})
# remove_tags = [
# dict(name='iframe'),
# dict(name='div', attrs={'class':['sectionHeader', 'tools','callToAction', 'contentContainer right two nopad relatedColumn']}),
#dict(name='div', attrs={'id':['shareContainer']}),
#dict(name='form', attrs={'onsubmit':"return verifySearch(this.w,'Keyword, citation, or #author')"}),
#dict(name='table', attrs={'cellspacing':'0'}),
# ]
# def preprocess_html(self, soup):
# table = soup.find('table')
# if table is not None:
# table.extract()
# return soup
#TO GET ARTICLES IN SECTION
def maya_parse_section(self, url):
soup = self.index_to_soup(url)
div = soup.find(attrs={'id':('SubjectHref'+"*")})
date = div.find(attrs={'href'})
current_articles = []
for x in date.findAllNext(attrs = {'id': ('SubjectHref'+"*")}):
if x = 30: break
for li in x.findAll('li'):
a = li.find('a', href=True)
if a is None:
continue
title = self.tag_to_string(a)
url = a.get('href', False)
if not url or not title:
continue
if url.startswith('/'):
url = 'http://www.nzherald.co.nz'+url
self.log('\t\tFound article:', title)
self.log('\t\t\t', url)
current_articles.append({'title': title, 'url':url,
'description':'', 'date':''})
return current_articles
# To GET SECTIONS
def parse_index(self):
feeds = ['example feed', 'http://maya.tase.co.il/bursa/index.asp?view=search&company_group=3000&arg_comp=&srh_comp_lb=1007&srh_from=2010-01-01&srh_until=2010-09-28&srh_anaf=-1&srh_event=9999&is_urgent=0&srh_company_press=']
articles = self.maya_parse_section(url)
if articles:
feeds.append((title, articles))
return feeds