View Single Post
Old 10-16-2010, 06:45 PM   #3
TonytheBookworm
Addict
TonytheBookworm is on a distinguished road
 
TonytheBookworm's Avatar
 
Posts: 264
Karma: 62
Join Date: May 2010
Device: kindle 2, kindle 3, Kindle fire
Quote:
Originally Posted by Starson17 View Post
You don't need to use obfuscated_article as far as I can tell.
Try this to start:
Spoiler:

Code:
class Star_Malaysia(BasicNewsRecipe):
    title          = u'The Star Malaysia'
    __author__          = 'Starson17'
    oldest_article = 20
    max_articles_per_feed = 10
    keep_only_tags     = [dict(name='div', attrs={'id':'story_main'})]

    remove_tags_after = dict(name='div', attrs={'id':'story_content'})

    feeds          = [(u'Nation News', u'http://thestar.com.my/rss/nation.xml'), (u'Business News', u'http://thestar.com.my/rss/business.xml'), (u'Technology News', u'http://thestar.com.my/rss/technology.xml'), (u'World Updates', u'http://thestar.com.my/rss/worldupdates.xml'), (u'Sports News', u'http://thestar.com.my/rss/sports.xml'), (u'Columnists', u'http://thestar.com.my/rss/columnists.xml'), (u'Opinions', u'http://thestar.com.my/rss/opinion.xml')]
I think he/she is trying to do what I have done in the past with using obfuscation to pull only the printer friendly version (basically short-cutting it )

@pip: the reason your code wasn't working is because the reg expression was was not escaped right.
here is working code using obfuscation.
Spoiler:
Code:
#!/usr/bin/env  python
__license__     = 'GPL v3'
__author__      = 'Tony Stegall'
__copyright__   = '2010, Tony Stegall or Tonythebookworm on mobileread.com'
__version__     = 'v1.01'
__date__        = '16, October 2010'
__description__ = 'The Star.com'

'''
thestar.com
'''

from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ptempfile import PersistentTemporaryFile

class LaWeekly(BasicNewsRecipe):
    __author__    = 'Tony Stegall'
    description   = 'The Star'
    masthead_url     = 'http://thestar.com.my/images/common/logo_tsolv12.gif'
    

    title          = 'TheStar.com'
    publisher      = 'TheStar.com'
    category       = 'News'

    language       = 'en'
    timefmt        = '[%a, %d %b, %Y]'

    oldest_article        = 10
    max_articles_per_feed = 100
    use_embedded_content  = False
    no_stylesheets = True

    remove_javascript     = True
    
    '''
    I use get_obfuscated_article to simply allow me to reg express search for the print friendly link
    otherwise, i could use def print_version but then i'm stuck with having to split urls and piece them together
    '''    

    temp_files = []
    articles_are_obfuscated = True

    def get_obfuscated_article(self, url):
        br = self.get_browser()
        print 'THE CURRENT URL IS: ', url
        br.open(url)
        '''
             we need to use a try catch block:
             what this does is trys to do an operation and if it fails instead of crashing it simply catchs it and does
             something with the error.
             So in our case we take and check to see if we can follow /services/printerfriendly.asp, then if we can't
             then we simply pass it back the original calling url 
        '''
        
        try:
         response = br.follow_link(url_regex='.*?(\\/services\\/printerfriendly\\.asp)', nr = 0)
         html = response.read()
        except:
         response = br.open(url)
         html = response.read()
         
        self.temp_files.append(PersistentTemporaryFile('_fa.html'))
        self.temp_files[-1].write(html)
        self.temp_files[-1].close()
        return self.temp_files[-1].name

    ######################################################################################################################

    feeds          = [(u'Nation News', u'http://thestar.com.my/rss/nation.xml'),
                      (u'Business News', u'http://thestar.com.my/rss/business.xml'),
                      (u'Technology News', u'http://thestar.com.my/rss/technology.xml'),
                      (u'World Updates', u'http://thestar.com.my/rss/worldupdates.xml'), 
                      (u'Sports News', u'http://thestar.com.my/rss/sports.xml'), 
                      (u'Columnists', u'http://thestar.com.my/rss/columnists.xml'), 
                      (u'Opinions', u'http://thestar.com.my/rss/opinion.xml')
                      ]
TonytheBookworm is offline   Reply With Quote