Quote:
Originally Posted by Starson17
You don't need to use obfuscated_article as far as I can tell.
Try this to start:
Spoiler:
Code:
class Star_Malaysia(BasicNewsRecipe):
title = u'The Star Malaysia'
__author__ = 'Starson17'
oldest_article = 20
max_articles_per_feed = 10
keep_only_tags = [dict(name='div', attrs={'id':'story_main'})]
remove_tags_after = dict(name='div', attrs={'id':'story_content'})
feeds = [(u'Nation News', u'http://thestar.com.my/rss/nation.xml'), (u'Business News', u'http://thestar.com.my/rss/business.xml'), (u'Technology News', u'http://thestar.com.my/rss/technology.xml'), (u'World Updates', u'http://thestar.com.my/rss/worldupdates.xml'), (u'Sports News', u'http://thestar.com.my/rss/sports.xml'), (u'Columnists', u'http://thestar.com.my/rss/columnists.xml'), (u'Opinions', u'http://thestar.com.my/rss/opinion.xml')]
|
I think he/she is trying to do what I have done in the past with using obfuscation to pull only the printer friendly version (basically short-cutting it
)
@pip: the reason your code wasn't working is because the reg expression was was not escaped right.
here is working code using obfuscation.
Spoiler:
Code:
#!/usr/bin/env python
__license__ = 'GPL v3'
__author__ = 'Tony Stegall'
__copyright__ = '2010, Tony Stegall or Tonythebookworm on mobileread.com'
__version__ = 'v1.01'
__date__ = '16, October 2010'
__description__ = 'The Star.com'
'''
thestar.com
'''
from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ptempfile import PersistentTemporaryFile
class LaWeekly(BasicNewsRecipe):
__author__ = 'Tony Stegall'
description = 'The Star'
masthead_url = 'http://thestar.com.my/images/common/logo_tsolv12.gif'
title = 'TheStar.com'
publisher = 'TheStar.com'
category = 'News'
language = 'en'
timefmt = '[%a, %d %b, %Y]'
oldest_article = 10
max_articles_per_feed = 100
use_embedded_content = False
no_stylesheets = True
remove_javascript = True
'''
I use get_obfuscated_article to simply allow me to reg express search for the print friendly link
otherwise, i could use def print_version but then i'm stuck with having to split urls and piece them together
'''
temp_files = []
articles_are_obfuscated = True
def get_obfuscated_article(self, url):
br = self.get_browser()
print 'THE CURRENT URL IS: ', url
br.open(url)
'''
we need to use a try catch block:
what this does is trys to do an operation and if it fails instead of crashing it simply catchs it and does
something with the error.
So in our case we take and check to see if we can follow /services/printerfriendly.asp, then if we can't
then we simply pass it back the original calling url
'''
try:
response = br.follow_link(url_regex='.*?(\\/services\\/printerfriendly\\.asp)', nr = 0)
html = response.read()
except:
response = br.open(url)
html = response.read()
self.temp_files.append(PersistentTemporaryFile('_fa.html'))
self.temp_files[-1].write(html)
self.temp_files[-1].close()
return self.temp_files[-1].name
######################################################################################################################
feeds = [(u'Nation News', u'http://thestar.com.my/rss/nation.xml'),
(u'Business News', u'http://thestar.com.my/rss/business.xml'),
(u'Technology News', u'http://thestar.com.my/rss/technology.xml'),
(u'World Updates', u'http://thestar.com.my/rss/worldupdates.xml'),
(u'Sports News', u'http://thestar.com.my/rss/sports.xml'),
(u'Columnists', u'http://thestar.com.my/rss/columnists.xml'),
(u'Opinions', u'http://thestar.com.my/rss/opinion.xml')
]