View Single Post
Old 09-30-2010, 06:55 PM   #8
TonytheBookworm
Addict
TonytheBookworm is on a distinguished road
 
TonytheBookworm's Avatar
 
Posts: 264
Karma: 62
Join Date: May 2010
Device: kindle 2, kindle 3, Kindle fire
Quote:
Originally Posted by SteffenH View Post
Yes, of course, needless links. What is a recipe wizard?
Depends on what the links are that you wish to remove. If they are links on the rss page itself like for instance if you have say a Gallery link full of photos but you don't want that in the fetch then you would simply do something along this.
if you had at http://blah.com/mypage.rss

- Lovely article
- Another Beautiful article
- Gallery: - Lovely photos of a dumpster
- More Stuff
- And More stuff
- Gallery: Dirt samples

you could go something like this in your code
Spoiler:

Code:
def get_article_url(self, article): 
        link = article.get('link')
        if 'gallery' not in link:
             return link

the above would search the rss feed links and if the link doesn't contain gallery it will return it otherwise they will be skipped.

if it is actually links inside the articles themselves for instance if you have

this little piggy went to the market. this little piggy stayed home. gallery: this little piggy's home

then you could do something along these lines:
Spoiler:

Code:
def preprocess_html(self, soup) :
    
      weblinks = soup.findAll(['head','h2'])
       if weblinks is not None:
          for link in weblinks:
            if re.search('(Gallery)(:)',str(link)):
               link.parent.extract()
        return soup

what the above will do is take and find all head and h2 tags in the soup (you will have to change it to suit your needs) then if it finds those tags it moves on down to the for look and checks each tag that is stored inside weblinks. by taking and doing a regexpress search for the values of Gallery: that is inside the link. if it finds it then it gets rid of it. then it returns the soup.
TonytheBookworm is offline   Reply With Quote