View Single Post
Old 06-03-2010, 05:34 AM   #2032
DoctorOhh
US Navy, Retired
DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.
 
DoctorOhh's Avatar
 
Posts: 9,864
Karma: 13806776
Join Date: Feb 2009
Location: North Carolina
Device: Icarus Illumina XL HD, Nexus 7
Quote:
Originally Posted by square4761 View Post

remove_tags = [
dict(name=['table', 'iframe', 'embed', 'object'])
]

remove_tags_after = dict(name='div', attrs={'class':'article_body'})


feeds = [(u'http://rss.townhall.com/blogs/main'),
(u'http://rss.townhall.com/columnists/all')
]

def print_version(self, url):
return url + '?page=full'
First, It is bad etiquette not to mention just plain wrong to publish someone else's name and email to the web. Please take a minute to edit the above post and remove same.

Second, I looked in my working area and I had a recipe just about complete for the columnists but the blogs eluded me because they use java to print the blog entries. If you replace the above with the code below you will be in the ball park for the columnists feed.

I lost interest in it so when you manage to get it working take credit and submit it for others to use. I attached the favicon for the site that you can add to the zip file when you upload it here.

Good Luck.

Code:
    keep_only_tags = [
      dict(name='div', attrs={'class':'authorblock'}),
      dict(name='div', attrs={'id':'columnBody'})
    ]

    remove_tags_after   = dict(name='div', attrs={'id':'columnBody'})

    remove_tags  = [
       dict(name=['iframe', 'img', 'embed', 'object','center','script','form']),
       dict(name='div', attrs={'id':['ShareText', 'Externa', 'Toolbox', 'ctl00_cphMain_cbComments_dlComments_ctl01_ctl00_Content', 'ArticleContainer', 'shirttail', 'comments_container', 'ctl00_cphMain_cbComments_dvReadAll', 'footer']})

    ]


    feeds = [(u'TownHall Columnists', u'http://rss.townhall.com/columnists/all')]
    
    

    def print_version(self, url):
        return url + '&page=full'
Attached Images
 
DoctorOhh is offline