MobileRead Forums - View Single Post - LA Weekly - Trouble

TonytheBookworm · 10-08-2010, 04:05 PM

Here is the working version of the code:
I didn't see starson17's post before I went a different route and used try/except statements which worked fine.

You might wanna remove a few more tags for junk but this should do it.

Spoiler:

Code:

#!/usr/bin/env  python
__license__     = 'GPL v3'
__author__      = 'Tony Stegall'
__copyright__   = '2010, Tony Stegall or Tonythebookworm on mobileread.com'
__version__     = 'v1.01'
__date__        = '07, October 2010'
__description__ = 'La weekly mag'

'''
http://www.laweekly.com
'''

from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ptempfile import PersistentTemporaryFile

class LaWeekly(BasicNewsRecipe):
    __author__    = 'Tony Stegall'
    description   = 'La Weekly Mag'
    cover_url     = 'http://assets.laweekly.com/img/citylogo-lg.png'
    

    title          = 'La WeeklyMag '
    publisher      = 'Laweekly.com'
    category       = 'News,US'

    language       = 'en'
    timefmt        = '[%a, %d %b, %Y]'

    oldest_article        = 15
    max_articles_per_feed = 25
    use_embedded_content  = False
    no_stylesheets = True

    remove_javascript     = True
    #####################################################################################
    # cleanup section                                                                   #
    #####################################################################################
    remove_tags        = [
                            dict(name='div', attrs={'class':['chisel_u r_box','sitenav','ListingsSearchWidgetHoriz','events_location_tabs location vcard']}),
                            dict(name='div', attrs={'id':['navBottom','comments','mac_tags']}),
                            dict(name='div', attrs={'class':['likemewidget chisel_u','events_more_events','chisel_u r_box city']}),
                            dict(name='div', attrs={'class':['bottom_bar','footer','binTitle']}),
                            dict(name='a', attrs={'class':'likeme_badge'})
                            
                        ]
    
    
    
    
    ######################################################################################################################
    '''
    We need to take and find all instances of /content/printVersion/
    So in order to do this we take and setup a temp list
    Then we turn on the flag to tell calibre/beautifulsoup that the articles are obfuscated
    Then we take and get the obfuscated article (in our case the print version)
    We take and create a browser and let calibre do all the work for us. It will open an internal browser and follow
    then links that match the regular expression of .*?(\\/)(content)(\\/)(printVersion)(\\/)
    so basically any link that looks like this /content/printVersion/
    it takes and writes all the information to a temp html file.  that the recipe/calibre will parse from.
    And thats all that is needed for this recipe.
    '''

    temp_files = []
    articles_are_obfuscated = True

    def get_obfuscated_article(self, url):
        br = self.get_browser()
        print 'THE CURRENT URL IS: ', url
        br.open(url)
        '''
         	we need to use a try catch block:
         	what this does is trys to do an operation and if it fails instead of crashing it simply catchs it and does
         	something with the error.
         	So in our case we take and check to see if we can follow /content/printVersion, then if we can't
         	then we simply pass it back the original calling url 
        '''
        
        try:
         response = br.follow_link(url_regex='.*?(\\/)(content)(\\/)(printVersion)(\\/)', nr = 0)
         html = response.read()
        except:
         response = br.open(url)
         html = response.read()
         
        self.temp_files.append(PersistentTemporaryFile('_fa.html'))
        self.temp_files[-1].write(html)
        self.temp_files[-1].close()
        return self.temp_files[-1].name

    ######################################################################################################################

    feeds          = [
                       (u'Complete Issue', u'http://www.laweekly.com/syndication/issue/'),
                       (u'News', u'http://www.laweekly.com/syndication/section/news/'),
                       (u'Music', u'http://www.laweekly.com/syndication/section/music/'),
                       (u'Movies', u'http://www.laweekly.com/syndication/section/film/'),
                       (u'Restaurants', u'http://www.laweekly.com/syndication/section/dining/'),
                       (u'Music Events', u'http://laweekly.com/syndication/events?type=music'),
                       (u'Calendar Events', u'http://laweekly.com/syndication/events'),
                       (u'Restaurant Guide', u'http://laweekly.com/syndication/restaurants/search/'),
                       
                     ]

P.S. thanks starson17 for the response I didn't see it before I finished this up.