Thread: maya recipe
View Single Post
Old 09-30-2010, 04:09 PM   #4
TonytheBookworm
Addict
TonytheBookworm is on a distinguished road
 
TonytheBookworm's Avatar
 
Posts: 264
Karma: 62
Join Date: May 2010
Device: kindle 2, kindle 3, Kindle fire
Quote:
Originally Posted by marbs View Post
the reason i didn't use regex to follow the link is because i haven't wrapped my head around it yet. i don't fully understand the concept. i tryed running your code.
when i used the 1st one i got raw HTML from the feed page.
when i used the 2nd code i got "NameError: global name 're' is not defined"
ill have to read it a bit more (after a good nights sleep.)

i am going to work on it some more....
you have to import re

the second set of code works. I tested it.
It is up to you to clean it up and get what you want and get rid of what you don't.
but as far as getting the link you wanted here is what i did.
Spoiler:

Code:
from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup, re
class AlisonB(BasicNewsRecipe):
    title      = 'blah'
    __author__ = 'Tonythebookworm'
    description = 'blah'
    language = 'en'
    no_stylesheets = True
    publisher           = 'Tonythebookworm'
    category            = 'column'
    use_embedded_content= False
    no_stylesheets      = True
    oldest_article      = 24
    remove_javascript   = True
    remove_empty_feeds  = True
    
    max_articles_per_feed = 10
    INDEX = '"http://maya.tase.co.il/'
    

    def parse_index(self):
        feeds = []
        for title, url in [
                            (u"Feed", u"http://maya.tase.co.il/bursa/index.asp?view=search&company_group=3000&arg_comp=&srh_comp_lb=1007&srh_from=2010-01-01&srh_until=2010-09-28&srh_anaf=-1&srh_event=9999&is_urgent=0&srh_company_press="),
                            
                            
                             ]:
            articles = self.make_links(url)
            if articles:
                feeds.append((title, articles))
        return feeds
        
    def make_links(self, url):
        title = 'Temp'
        current_articles = []
        soup = self.index_to_soup(url)
        print 'The soup is: ', soup
        for item in soup.findAll('a',attrs={'class':'A3'}):
            print 'item is: ',item
            #link = item.find('a')
            #titlecheck = self.tag_to_string(link)
            #url_test = re.search('javascript', item['href'])
           
            if not re.search('javascript', item['href']):
              print 'FOUND GOOD URL'
              url         = self.INDEX + item['href']
              print 'url is: ', url
              title       = self.tag_to_string(item)
              print 'title is: ', title
            current_articles.append({'title': title, 'url': url, 'description':'', 'date':''}) # append all this
            
           
        return current_articles

Last edited by TonytheBookworm; 09-30-2010 at 04:14 PM. Reason: added code
TonytheBookworm is offline   Reply With Quote