So close yet so far... frustrated recipe

PipSqueak · 10-16-2010, 01:23 PM

Can anyone help me fix this recipe? I'm trying to fetch news from a local newspaper. I think I'm *almost* there, but I suck with the regex because I don't know programming. Thanks

P.S. searched the forums and spent hours and hours doing the recipe before posting here as last resort

Code:

class AdvancedUserRecipe1287215970(BasicNewsRecipe):
    title          = u'The Star Malaysia'
    oldest_article = 2
    max_articles_per_feed = 1

    feeds          = [(u'Nation News', u'http://thestar.com.my/rss/nation.xml'), (u'Business News', u'http://thestar.com.my/rss/business.xml'), (u'Technology News', u'http://thestar.com.my/rss/technology.xml'), (u'World Updates', u'http://thestar.com.my/rss/worldupdates.xml'), (u'Sports News', u'http://thestar.com.my/rss/sports.xml'), (u'Columnists', u'http://thestar.com.my/rss/columnists.xml'), (u'Opinions', u'http://thestar.com.my/rss/opinion.xml')]

    from calibre.ptempfile import PersistentTemporaryFile
    temp_files = []
    articles_are_obfuscated = True

    def get_obfuscated_article(self, url):
        br = self.get_browser()
        br.open(url)

        response = br.follow_link(url_regex = r'/printerfriendly.asp?file=')
        html = response.read()

        self.temp_files.append(PersistentTemporaryFile('_fa.html'))
        self.temp_files[-1].write(html)
        self.temp_files[-1].close()

        return self.temp_files[-1].name

Starson17 · 10-16-2010, 03:30 PM

Quote:

Originally Posted by PipSqueak

Can anyone help me fix this recipe?

You don't need to use obfuscated_article as far as I can tell.
Try this to start:

Spoiler:

TonytheBookworm · 10-16-2010, 06:45 PM

Quote:

Originally Posted by Starson17

You don't need to use obfuscated_article as far as I can tell.
Try this to start:

Spoiler:

I think he/she is trying to do what I have done in the past with using obfuscation to pull only the printer friendly version (basically short-cutting it

)

@pip: the reason your code wasn't working is because the reg expression was was not escaped right.
here is working code using obfuscation.

Spoiler:

Code:

#!/usr/bin/env  python
__license__     = 'GPL v3'
__author__      = 'Tony Stegall'
__copyright__   = '2010, Tony Stegall or Tonythebookworm on mobileread.com'
__version__     = 'v1.01'
__date__        = '16, October 2010'
__description__ = 'The Star.com'

'''
thestar.com
'''

from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ptempfile import PersistentTemporaryFile

class LaWeekly(BasicNewsRecipe):
    __author__    = 'Tony Stegall'
    description   = 'The Star'
    masthead_url     = 'http://thestar.com.my/images/common/logo_tsolv12.gif'
    

    title          = 'TheStar.com'
    publisher      = 'TheStar.com'
    category       = 'News'

    language       = 'en'
    timefmt        = '[%a, %d %b, %Y]'

    oldest_article        = 10
    max_articles_per_feed = 100
    use_embedded_content  = False
    no_stylesheets = True

    remove_javascript     = True
    
    '''
    I use get_obfuscated_article to simply allow me to reg express search for the print friendly link
    otherwise, i could use def print_version but then i'm stuck with having to split urls and piece them together
    '''    

    temp_files = []
    articles_are_obfuscated = True

    def get_obfuscated_article(self, url):
        br = self.get_browser()
        print 'THE CURRENT URL IS: ', url
        br.open(url)
        '''
             we need to use a try catch block:
             what this does is trys to do an operation and if it fails instead of crashing it simply catchs it and does
             something with the error.
             So in our case we take and check to see if we can follow /services/printerfriendly.asp, then if we can't
             then we simply pass it back the original calling url 
        '''
        
        try:
         response = br.follow_link(url_regex='.*?(\\/services\\/printerfriendly\\.asp)', nr = 0)
         html = response.read()
        except:
         response = br.open(url)
         html = response.read()
         
        self.temp_files.append(PersistentTemporaryFile('_fa.html'))
        self.temp_files[-1].write(html)
        self.temp_files[-1].close()
        return self.temp_files[-1].name

    ######################################################################################################################

    feeds          = [(u'Nation News', u'http://thestar.com.my/rss/nation.xml'),
                      (u'Business News', u'http://thestar.com.my/rss/business.xml'),
                      (u'Technology News', u'http://thestar.com.my/rss/technology.xml'),
                      (u'World Updates', u'http://thestar.com.my/rss/worldupdates.xml'), 
                      (u'Sports News', u'http://thestar.com.my/rss/sports.xml'), 
                      (u'Columnists', u'http://thestar.com.my/rss/columnists.xml'), 
                      (u'Opinions', u'http://thestar.com.my/rss/opinion.xml')
                      ]

PipSqueak · 10-16-2010, 08:02 PM

Thanks Starson17 and Tony for the recipe!

I have no background in programming, so it's easier for me to copy the example given in the recipe manual than to make one from scratch.

By the way, the div version captures pictures and is 1.2mb in size whereas the printversion is 0.5mb but pictureless. I had a look at some of these printversions and they do show pictures, could it be because the stylesheet is turned off that these pics weren't captured? For example this url: http://biz.thestar.com.my/news/story...9&sec=business

Starson17 · 10-17-2010, 08:40 AM

Quote:

Originally Posted by TonytheBookworm

I think he/she is trying to do what I have done in the past with using obfuscation to pull only the printer friendly version (basically short-cutting it

)

If that's the case, then you should show him how to use the tool that's designed to do that job - print_version. It still doesn't look to me like there's any obfuscation going on. I briefly looked at the print link and it appeared to be a simple text substitution in the link.

TonytheBookworm · 10-17-2010, 06:41 PM

Quote:

Originally Posted by Starson17

If that's the case, then you should show him how to use the tool that's designed to do that job - print_version. It still doesn't look to me like there's any obfuscation going on. I briefly looked at the print link and it appeared to be a simple text substitution in the link.

I agree but I like to change things up from time to time. That's how i learn by doing things differently. Kinda like driving to work every day. Same ol road same ol trees same ol houses. But if i go a different route I might discover something different. Anyway, I see your point.

Starson17 · 10-18-2010, 10:50 AM

Quote:

Originally Posted by TonytheBookworm

I agree but I like to change things up from time to time. That's how i learn by doing things differently. Kinda like driving to work every day. Same ol road same ol trees same ol houses. But if i go a different route I might discover something different. Anyway, I see your point.

I do that, too - try to accomplish something a different way to make sure I understand it. There's nothing wrong with that, but you can see that it confused me, so it may confuse others who look at your recipe. I came to the obfuscated options very late, and tend to think of them as the "last resort," so I start looking for what tricky problem the site has that requires doing it that way. I get confused when I can't find the reason that obfuscation is specified.

For others: what Tony and I are talking about is that Tony has used a sophisticated option to download a page from the article, then "click" on a button on that page to get the print version. It works just like your browser works by setting up an internal browser session. To use his code, you use a regex to "find" and click the button on the downloaded page that gets the print version.

Kovid has what I consider to be a more straight forward way of getting the print version. You look at the page, find the same link that Tony's code searches for, and tell your recipe to modify the article link to go directly to the print version page. It skips the steps of setting up an internal browser, downloading the page locally, keeping track of cookies, searching in that page via the regex for the link, then clicking the print version button. Tony's obfuscated method works when there's no way to figure out how to change the article link to the print version link, or where the site requires certain cookies to be set before you can get the print version.

Both work for normal print version links, and Tony's code works in more situations than the simpler code (i.e. when the link really is "obfuscated"), but at the cost of slightly greater complexity and slower speed. Each recipe author uses their own techniques.

10-16-2010, 08:02 PM	#4
PipSqueak Junior Member Posts: 3 Karma: 10 Join Date: Oct 2010 Device: Kindle	Thanks Starson17 and Tony for the recipe! I have no background in programming, so it's easier for me to copy the example given in the recipe manual than to make one from scratch. By the way, the div version captures pictures and is 1.2mb in size whereas the printversion is 0.5mb but pictureless. I had a look at some of these printversions and they do show pictures, could it be because the stylesheet is turned off that these pics weren't captured? For example this url: http://biz.thestar.com.my/news/story...9&sec=business Last edited by PipSqueak; 10-16-2010 at 09:11 PM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Frustrated...	cypherslock	Amazon Kindle	3	04-03-2010 05:23 PM
So frustrated	lishy75	Sony Reader	5	04-19-2009 05:21 PM
FRUSTRATED!	jcbeam	Amazon Kindle	33	03-21-2009 08:58 AM
New and Frustrated	STORMCROW	Introduce Yourself	7	02-27-2008 09:58 PM

Advert

Advert