Quote:
Originally Posted by Starson17
Sorry, but I can't quite follow your question. Are you saying you can't reference tags by "id" or "href," etc.?
I've never run into the trailing slashes inside opening tags like you've posted, so I have no first hand experience. I would still expect normal referencing to work, but if it doesn't, you have various options. You can try search and replace to remove them with preprocess_regexps. You could remove just the slashes, or modify the whole tag with S&R, or use pre or postprocess_html and Beautiful Soup to identify the tag and extract or modify it. It's possible the slashes are confusing Beautiful Soup, so printing the results (see code in my post above on how to do this) might help you figure out what the recipe is seeing and where it's being confused.
More info would be needed to advise further.
|
Thanks for your help and sorry for my confusing expression.
The following is part of the source code, frow which I try to get feed.
Code:
<div id="rightContainer" />
<span id="list" />
<ul><li><a href="/Health_Report_1.html" target="_blank">[ <font color=#E43026>Health Report</font> ] </a> <a href="/lrc/201008/se-health-cancer-developing-world-25aug10.lrc" target=_blank><img src=/images/lrc.gif border=0></a> <a href="/VOA_Special_English/Experts-Urge-More-Efforts-to-Fight-Cancer-in-Poor-Countries-38652_1.html" target="_blank"><img src=/images/yi.gif border=0></a> <a href="/VOA_Special_English/Experts-Urge-More-Efforts-to-Fight-Cancer-in-Poor-Countries-38652.html" target="_blank">Experts Urge More Efforts to Fight Cancer in Poor Countries (2010-8-25)</a></li></ul>
</span>
</div>
My recipe is
Code:
import re
from calibre.web.feeds.news import BasicNewsRecipe
class VOA(BasicNewsRecipe):
title = 'VOA News'
__author__ = 'voa'
description = 'VOA through 51'
language = 'en'
remove_javascript = True
remove_tags_before = dict(id=['rightContainer'])
remove_tags_after = dict(id=['listads'])
remove_tags = [
dict(id=['contentAds']), dict(id=['playbar']), dict(id=['menubar']),
]
no_stylesheets = True
extra_css = '''
'''
def parse_index(self):
soup = self.index_to_soup('http://www.51voa.com/')
feeds = []
section = []
title = None
#for x in soup.find(id='list').findAll('a'):
for x in soup.find(id='rightContainer').findAll('a'):
if '/VOA_Special_English/' in x['href'] or '/VOA_Standard_English/' in x['href'] or '/VOA_Standard_English/' in x['href']:
article = {
'url' : 'http://www.51voa.com/' + x['href'],
'title' : self.tag_to_string(x),
'date': '',
'description': '',
}
section.append(article)
feeds.append(('Newest', section))
return feeds
I use the recipe here to fetch the feed from the source code, but get no links. could you give an example for how to use "regexps" to deal with the weird code here, and in case
tag comes in. Thanks a lot for your teaching.