View Single Post
Old 08-25-2010, 11:47 AM   #2519
naisren
Enthusiast
naisren began at the beginning.
 
Posts: 41
Karma: 12
Join Date: Jul 2009
Device: ppc
Quote:
Originally Posted by Starson17 View Post
Sorry, but I can't quite follow your question. Are you saying you can't reference tags by "id" or "href," etc.?

I've never run into the trailing slashes inside opening tags like you've posted, so I have no first hand experience. I would still expect normal referencing to work, but if it doesn't, you have various options. You can try search and replace to remove them with preprocess_regexps. You could remove just the slashes, or modify the whole tag with S&R, or use pre or postprocess_html and Beautiful Soup to identify the tag and extract or modify it. It's possible the slashes are confusing Beautiful Soup, so printing the results (see code in my post above on how to do this) might help you figure out what the recipe is seeing and where it's being confused.

More info would be needed to advise further.
Thanks for your help and sorry for my confusing expression.

The following is part of the source code, frow which I try to get feed.

Code:
<div id="rightContainer" />
<span id="list" />
<ul><li><a href="/Health_Report_1.html" target="_blank">[ <font color=#E43026>Health Report</font> ] </a> <a href="/lrc/201008/se-health-cancer-developing-world-25aug10.lrc" target=_blank><img src=/images/lrc.gif border=0></a> <a href="/VOA_Special_English/Experts-Urge-More-Efforts-to-Fight-Cancer-in-Poor-Countries-38652_1.html" target="_blank"><img src=/images/yi.gif border=0></a> <a href="/VOA_Special_English/Experts-Urge-More-Efforts-to-Fight-Cancer-in-Poor-Countries-38652.html" target="_blank">Experts Urge More Efforts to Fight Cancer in Poor Countries  (2010-8-25)</a></li></ul>
</span>
</div>
My recipe is
Code:
import re
from calibre.web.feeds.news import BasicNewsRecipe

class VOA(BasicNewsRecipe):

    title      = 'VOA News'
    __author__ = 'voa'
    description = 'VOA through 51'
    language = 'en'
    remove_javascript = True

    remove_tags_before = dict(id=['rightContainer'])
    remove_tags_after  = dict(id=['listads'])
    remove_tags        = [
                          dict(id=['contentAds']), dict(id=['playbar']), dict(id=['menubar']), 
                         ]    
    no_stylesheets = True
    extra_css = '''
                '''


    def parse_index(self):
        soup = self.index_to_soup('http://www.51voa.com/')
        feeds = []
        section = []
        title = None

       #for x in soup.find(id='list').findAll('a'):
        for x in soup.find(id='rightContainer').findAll('a'):
                if '/VOA_Special_English/' in x['href'] or '/VOA_Standard_English/' in x['href'] or '/VOA_Standard_English/' in x['href']:
                    article = {
                            'url' : 'http://www.51voa.com/' + x['href'],
                            'title' : self.tag_to_string(x),
                            'date': '',
                            'description': '',
                        }
                    section.append(article)

        feeds.append(('Newest', section))

        return feeds
I use the recipe here to fetch the feed from the source code, but get no links. could you give an example for how to use "regexps" to deal with the weird code here, and in case
Code:
<br/>
tag comes in. Thanks a lot for your teaching.
naisren is offline