Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 10-16-2010, 01:23 PM   #1
PipSqueak
Junior Member
PipSqueak began at the beginning.
 
Posts: 3
Karma: 10
Join Date: Oct 2010
Device: Kindle
So close yet so far... frustrated recipe

Can anyone help me fix this recipe? I'm trying to fetch news from a local newspaper. I think I'm *almost* there, but I suck with the regex because I don't know programming. Thanks

P.S. searched the forums and spent hours and hours doing the recipe before posting here as last resort

Code:
class AdvancedUserRecipe1287215970(BasicNewsRecipe):
    title          = u'The Star Malaysia'
    oldest_article = 2
    max_articles_per_feed = 1

    feeds          = [(u'Nation News', u'http://thestar.com.my/rss/nation.xml'), (u'Business News', u'http://thestar.com.my/rss/business.xml'), (u'Technology News', u'http://thestar.com.my/rss/technology.xml'), (u'World Updates', u'http://thestar.com.my/rss/worldupdates.xml'), (u'Sports News', u'http://thestar.com.my/rss/sports.xml'), (u'Columnists', u'http://thestar.com.my/rss/columnists.xml'), (u'Opinions', u'http://thestar.com.my/rss/opinion.xml')]

    from calibre.ptempfile import PersistentTemporaryFile
    temp_files = []
    articles_are_obfuscated = True

    def get_obfuscated_article(self, url):
        br = self.get_browser()
        br.open(url)

        response = br.follow_link(url_regex = r'/printerfriendly.asp?file=')
        html = response.read()

        self.temp_files.append(PersistentTemporaryFile('_fa.html'))
        self.temp_files[-1].write(html)
        self.temp_files[-1].close()

        return self.temp_files[-1].name
PipSqueak is offline   Reply With Quote
Old 10-16-2010, 03:30 PM   #2
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by PipSqueak View Post
Can anyone help me fix this recipe?
You don't need to use obfuscated_article as far as I can tell.
Try this to start:
Spoiler:

Code:
class Star_Malaysia(BasicNewsRecipe):
    title          = u'The Star Malaysia'
    __author__          = 'Starson17'
    oldest_article = 20
    max_articles_per_feed = 10
    keep_only_tags     = [dict(name='div', attrs={'id':'story_main'})]

    remove_tags_after = dict(name='div', attrs={'id':'story_content'})

    feeds          = [(u'Nation News', u'http://thestar.com.my/rss/nation.xml'), (u'Business News', u'http://thestar.com.my/rss/business.xml'), (u'Technology News', u'http://thestar.com.my/rss/technology.xml'), (u'World Updates', u'http://thestar.com.my/rss/worldupdates.xml'), (u'Sports News', u'http://thestar.com.my/rss/sports.xml'), (u'Columnists', u'http://thestar.com.my/rss/columnists.xml'), (u'Opinions', u'http://thestar.com.my/rss/opinion.xml')]
Starson17 is offline   Reply With Quote
Advert
Old 10-16-2010, 06:45 PM   #3
TonytheBookworm
Addict
TonytheBookworm is on a distinguished road
 
TonytheBookworm's Avatar
 
Posts: 264
Karma: 62
Join Date: May 2010
Device: kindle 2, kindle 3, Kindle fire
Quote:
Originally Posted by Starson17 View Post
You don't need to use obfuscated_article as far as I can tell.
Try this to start:
Spoiler:

Code:
class Star_Malaysia(BasicNewsRecipe):
    title          = u'The Star Malaysia'
    __author__          = 'Starson17'
    oldest_article = 20
    max_articles_per_feed = 10
    keep_only_tags     = [dict(name='div', attrs={'id':'story_main'})]

    remove_tags_after = dict(name='div', attrs={'id':'story_content'})

    feeds          = [(u'Nation News', u'http://thestar.com.my/rss/nation.xml'), (u'Business News', u'http://thestar.com.my/rss/business.xml'), (u'Technology News', u'http://thestar.com.my/rss/technology.xml'), (u'World Updates', u'http://thestar.com.my/rss/worldupdates.xml'), (u'Sports News', u'http://thestar.com.my/rss/sports.xml'), (u'Columnists', u'http://thestar.com.my/rss/columnists.xml'), (u'Opinions', u'http://thestar.com.my/rss/opinion.xml')]
I think he/she is trying to do what I have done in the past with using obfuscation to pull only the printer friendly version (basically short-cutting it )

@pip: the reason your code wasn't working is because the reg expression was was not escaped right.
here is working code using obfuscation.
Spoiler:
Code:
#!/usr/bin/env  python
__license__     = 'GPL v3'
__author__      = 'Tony Stegall'
__copyright__   = '2010, Tony Stegall or Tonythebookworm on mobileread.com'
__version__     = 'v1.01'
__date__        = '16, October 2010'
__description__ = 'The Star.com'

'''
thestar.com
'''

from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ptempfile import PersistentTemporaryFile

class LaWeekly(BasicNewsRecipe):
    __author__    = 'Tony Stegall'
    description   = 'The Star'
    masthead_url     = 'http://thestar.com.my/images/common/logo_tsolv12.gif'
    

    title          = 'TheStar.com'
    publisher      = 'TheStar.com'
    category       = 'News'

    language       = 'en'
    timefmt        = '[%a, %d %b, %Y]'

    oldest_article        = 10
    max_articles_per_feed = 100
    use_embedded_content  = False
    no_stylesheets = True

    remove_javascript     = True
    
    '''
    I use get_obfuscated_article to simply allow me to reg express search for the print friendly link
    otherwise, i could use def print_version but then i'm stuck with having to split urls and piece them together
    '''    

    temp_files = []
    articles_are_obfuscated = True

    def get_obfuscated_article(self, url):
        br = self.get_browser()
        print 'THE CURRENT URL IS: ', url
        br.open(url)
        '''
             we need to use a try catch block:
             what this does is trys to do an operation and if it fails instead of crashing it simply catchs it and does
             something with the error.
             So in our case we take and check to see if we can follow /services/printerfriendly.asp, then if we can't
             then we simply pass it back the original calling url 
        '''
        
        try:
         response = br.follow_link(url_regex='.*?(\\/services\\/printerfriendly\\.asp)', nr = 0)
         html = response.read()
        except:
         response = br.open(url)
         html = response.read()
         
        self.temp_files.append(PersistentTemporaryFile('_fa.html'))
        self.temp_files[-1].write(html)
        self.temp_files[-1].close()
        return self.temp_files[-1].name

    ######################################################################################################################

    feeds          = [(u'Nation News', u'http://thestar.com.my/rss/nation.xml'),
                      (u'Business News', u'http://thestar.com.my/rss/business.xml'),
                      (u'Technology News', u'http://thestar.com.my/rss/technology.xml'),
                      (u'World Updates', u'http://thestar.com.my/rss/worldupdates.xml'), 
                      (u'Sports News', u'http://thestar.com.my/rss/sports.xml'), 
                      (u'Columnists', u'http://thestar.com.my/rss/columnists.xml'), 
                      (u'Opinions', u'http://thestar.com.my/rss/opinion.xml')
                      ]
TonytheBookworm is offline   Reply With Quote
Old 10-16-2010, 08:02 PM   #4
PipSqueak
Junior Member
PipSqueak began at the beginning.
 
Posts: 3
Karma: 10
Join Date: Oct 2010
Device: Kindle
Thanks Starson17 and Tony for the recipe!

I have no background in programming, so it's easier for me to copy the example given in the recipe manual than to make one from scratch.

By the way, the div version captures pictures and is 1.2mb in size whereas the printversion is 0.5mb but pictureless. I had a look at some of these printversions and they do show pictures, could it be because the stylesheet is turned off that these pics weren't captured? For example this url: http://biz.thestar.com.my/news/story...9&sec=business

Last edited by PipSqueak; 10-16-2010 at 09:11 PM.
PipSqueak is offline   Reply With Quote
Old 10-17-2010, 08:40 AM   #5
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by TonytheBookworm View Post
I think he/she is trying to do what I have done in the past with using obfuscation to pull only the printer friendly version (basically short-cutting it )
If that's the case, then you should show him how to use the tool that's designed to do that job - print_version. It still doesn't look to me like there's any obfuscation going on. I briefly looked at the print link and it appeared to be a simple text substitution in the link.
Starson17 is offline   Reply With Quote
Advert
Old 10-17-2010, 06:41 PM   #6
TonytheBookworm
Addict
TonytheBookworm is on a distinguished road
 
TonytheBookworm's Avatar
 
Posts: 264
Karma: 62
Join Date: May 2010
Device: kindle 2, kindle 3, Kindle fire
Quote:
Originally Posted by Starson17 View Post
If that's the case, then you should show him how to use the tool that's designed to do that job - print_version. It still doesn't look to me like there's any obfuscation going on. I briefly looked at the print link and it appeared to be a simple text substitution in the link.
I agree but I like to change things up from time to time. That's how i learn by doing things differently. Kinda like driving to work every day. Same ol road same ol trees same ol houses. But if i go a different route I might discover something different. Anyway, I see your point.
TonytheBookworm is offline   Reply With Quote
Old 10-18-2010, 10:50 AM   #7
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by TonytheBookworm View Post
I agree but I like to change things up from time to time. That's how i learn by doing things differently. Kinda like driving to work every day. Same ol road same ol trees same ol houses. But if i go a different route I might discover something different. Anyway, I see your point.
I do that, too - try to accomplish something a different way to make sure I understand it. There's nothing wrong with that, but you can see that it confused me, so it may confuse others who look at your recipe. I came to the obfuscated options very late, and tend to think of them as the "last resort," so I start looking for what tricky problem the site has that requires doing it that way. I get confused when I can't find the reason that obfuscation is specified.

For others: what Tony and I are talking about is that Tony has used a sophisticated option to download a page from the article, then "click" on a button on that page to get the print version. It works just like your browser works by setting up an internal browser session. To use his code, you use a regex to "find" and click the button on the downloaded page that gets the print version.

Kovid has what I consider to be a more straight forward way of getting the print version. You look at the page, find the same link that Tony's code searches for, and tell your recipe to modify the article link to go directly to the print version page. It skips the steps of setting up an internal browser, downloading the page locally, keeping track of cookies, searching in that page via the regex for the link, then clicking the print version button. Tony's obfuscated method works when there's no way to figure out how to change the article link to the print version link, or where the site requires certain cookies to be set before you can get the print version.

Both work for normal print version links, and Tony's code works in more situations than the simpler code (i.e. when the link really is "obfuscated"), but at the cost of slightly greater complexity and slower speed. Each recipe author uses their own techniques.

Last edited by Starson17; 10-18-2010 at 10:54 AM.
Starson17 is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Frustrated... cypherslock Amazon Kindle 3 04-03-2010 05:23 PM
So frustrated lishy75 Sony Reader 5 04-19-2009 05:21 PM
FRUSTRATED! jcbeam Amazon Kindle 33 03-21-2009 08:58 AM
New and Frustrated STORMCROW Introduce Yourself 7 02-27-2008 09:58 PM


All times are GMT -4. The time now is 08:59 PM.


MobileRead.com is a privately owned, operated and funded community.