MobileRead Forums - View Single Post

TonytheBookworm · 09-30-2010, 05:09 PM

Quote:

Originally Posted by marbs

the reason i didn't use regex to follow the link is because i haven't wrapped my head around it yet. i don't fully understand the concept. i tryed running your code.
when i used the 1st one i got raw HTML from the feed page.
when i used the 2nd code i got "NameError: global name 're' is not defined"
ill have to read it a bit more (after a good nights sleep.)

i am going to work on it some more....

you have to import re

the second set of code works. I tested it.
It is up to you to clean it up and get what you want and get rid of what you don't.
but as far as getting the link you wanted here is what i did.

Spoiler:

Code:

from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup, re
class AlisonB(BasicNewsRecipe):
    title      = 'blah'
    __author__ = 'Tonythebookworm'
    description = 'blah'
    language = 'en'
    no_stylesheets = True
    publisher           = 'Tonythebookworm'
    category            = 'column'
    use_embedded_content= False
    no_stylesheets      = True
    oldest_article      = 24
    remove_javascript   = True
    remove_empty_feeds  = True
    
    max_articles_per_feed = 10
    INDEX = '"http://maya.tase.co.il/'
    

    def parse_index(self):
        feeds = []
        for title, url in [
                            (u"Feed", u"http://maya.tase.co.il/bursa/index.asp?view=search&company_group=3000&arg_comp=&srh_comp_lb=1007&srh_from=2010-01-01&srh_until=2010-09-28&srh_anaf=-1&srh_event=9999&is_urgent=0&srh_company_press="),
                            
                            
                             ]:
            articles = self.make_links(url)
            if articles:
                feeds.append((title, articles))
        return feeds
        
    def make_links(self, url):
        title = 'Temp'
        current_articles = []
        soup = self.index_to_soup(url)
        print 'The soup is: ', soup
        for item in soup.findAll('a',attrs={'class':'A3'}):
            print 'item is: ',item
            #link = item.find('a')
            #titlecheck = self.tag_to_string(link)
            #url_test = re.search('javascript', item['href'])
           
            if not re.search('javascript', item['href']):
              print 'FOUND GOOD URL'
              url         = self.INDEX + item['href']
              print 'url is: ', url
              title       = self.tag_to_string(item)
              print 'title is: ', title
            current_articles.append({'title': title, 'url': url, 'description':'', 'date':''}) # append all this
            
           
        return current_articles