View Single Post
Old 07-10-2010, 08:07 AM   #2289
rty
Zealot
rty got an A in P-Chem.rty got an A in P-Chem.rty got an A in P-Chem.rty got an A in P-Chem.rty got an A in P-Chem.rty got an A in P-Chem.rty got an A in P-Chem.rty got an A in P-Chem.rty got an A in P-Chem.rty got an A in P-Chem.rty got an A in P-Chem.
 
Posts: 108
Karma: 6066
Join Date: Apr 2010
Location: Singapore
Device: iPad Air, Kindle DXG, Kindle Paperwhite
Quote:
Originally Posted by einstuerzende View Post
rty,

I've been fumbling around with making a recipe for cn.wsj.com without an awful lot of success. If you have time and are taking any requests, I'd appreciate whatever help you could give. I'm trying to get the Traditional character edition, which I think means throwing "big5" in front of everything (ex: http://cn.wsj.com/big5/20100708/FRX003561.asp)
http://chinese.wsj.com/gb/rss01.xml

Would you like to take a look at the recipe code below? It pulls all the correct articles but for some reason, the 'remove_tags_after' doesn't work on this particular site. Basically you want to remove everything after the Division with id='toolbar_tb'

Spoiler:

Code:
class AdvancedUserRecipe1278740771(BasicNewsRecipe):
    title          = u'WSJ 华尔街日报'
    __author__ = 'x'
    oldest_article = 14
    max_articles_per_feed = 100
    timefmt = ' [%Y %b %d]'
    feeds          = [
	#(u'要闻', u'http://chinese.wsj.com/gb/rss01.xml'),
	#(u'特写', u'http://chinese.wsj.com/gb/rss02.xml'),
	(u'国际财经', u'http://chinese.wsj.com/gb/rssglobal.xml'),
	#(u'能源与汽车', u'http://chinese.wsj.com/gb/rssautoene.xml')

	]
    language = 'zh-cn'
    pubisher  = 'Dow Jones & Company, Inc.'
    description           = 'Wall Stree Journal - Chinese edition'
    category              = 'News, Business'
    remove_javascript = True
    use_embedded_content   = False
    no_stylesheets = True
    encoding               = 'GB2312'
    #conversion_options = {'linearize_tables':True} 


    extra_css = '''
             @font-face { font-family: "DroidFont", serif, sans-serif;  src: url(res:///system/fonts/DroidSansFallback.ttf); }\n 
             body { 
                  margin-right: 8pt; 
                  font-family: 'DroidFont', serif;}
             .left_content {font-family: 'DroidFont', serif, sans-serif}
            '''
 
    remove_tags_after = [dict(name='div', attrs={'id':'toolbar_tb'})]
    keep_only_tags = [dict(name='div', attrs={'id':['headline','bodytext']})]
    remove_tags = [
                              dict(name='div', attrs={'id':['tabdiv','toolbar_tt','toolbar_tb','bottom1','sponsor','nav','column2']}),
                               ]

    def preprocess_html(self, soup):
        for item in soup.findAll(style=True):
           del item['style']
        for item in soup.findAll(width=True):
           del item['width']
        return soup
rty is offline