View Single Post
Old 10-07-2010, 11:12 PM   #6
TonytheBookworm
Addict
TonytheBookworm is on a distinguished road
 
TonytheBookworm's Avatar
 
Posts: 264
Karma: 62
Join Date: May 2010
Device: kindle 2, kindle 3, Kindle fire
Hey Starson17 or anyone else for that matter. How do you check a mechanized follow to make sure it is a valid link? more specifically if I have a combination of feeds that mostly follow the url_regex of .*?\\/content\\/printVersion
but some of the feeds do not have that link inside. How do i test that?
I keep getting linknotfound errors on the event feeds because they do not contain a /content/printVersion link in them. In that cause I would like it to simply return the calling url.

here is the code I have thus far. Everything works except the music and events feeds because of the above mentioned issue.
Thanks.
Here is the section i'm having issues with
Spoiler:

Code:
def get_obfuscated_article(self, url):
        br = self.get_browser()
        print 'THE CURRENT URL IS: ', url
        br.open(url)
        response = br.follow_link(url_regex='.*?(\\/)(content)(\\/)(printVersion)(\\/)', nr = 0)
        
        if response is None:
           response = br.follow_link(url, nr=0)
        html = response.read()
        
        self.temp_files.append(PersistentTemporaryFile('_fa.html'))
        self.temp_files[-1].write(html)
        self.temp_files[-1].close()
        return self.temp_files[-1].name

and here is the whole code
Spoiler:

Code:
#!/usr/bin/env  python
__license__     = 'GPL v3'
__author__      = 'Tony Stegall'
__copyright__   = '2010, Tony Stegall or Tonythebookworm on mobileread.com'
__version__     = 'v1.01'
__date__        = '07, October 2010'
__description__ = 'La weekly mag'

'''
http://www.laweekly.com
'''

from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ptempfile import PersistentTemporaryFile

class LaWeekly(BasicNewsRecipe):
    __author__    = 'Tony Stegall'
    description   = 'La Weekly Mag'
    cover_url     = 'http://assets.laweekly.com/img/citylogo-lg.png'
    

    title          = 'La WeeklyMag '
    publisher      = 'Laweekly.com'
    category       = 'News,US'

    language       = 'en'
    timefmt        = '[%a, %d %b, %Y]'

    oldest_article        = 15
    max_articles_per_feed = 25
    use_embedded_content  = False
    

    remove_javascript     = True
    ######################################################################################################################
    '''
    We need to take and find all instances of /content/printVersion/
    So in order to do this we take and setup a temp list
    Then we turn on the flag to tell calibre/beautifulsoup that the articles are obfuscated
    Then we take and get the obfuscated article (in our case the print version)
    We take and create a browser and let calibre do all the work for us. It will open an internal browser and follow
    then links that match the regular expression of .*?(\\/)(content)(\\/)(printVersion)(\\/)
    so basically any link that looks like this /content/printVersion/
    it takes and writes all the information to a temp html file.  that the recipe/calibre will parse from.
    And thats all that is needed for this recipe.
    '''

    temp_files = []
    articles_are_obfuscated = True

    def get_obfuscated_article(self, url):
        br = self.get_browser()
        print 'THE CURRENT URL IS: ', url
        br.open(url)
        response = br.follow_link(url_regex='.*?(\\/)(content)(\\/)(printVersion)(\\/)', nr = 0)
        
        if response is None:
           response = br.follow_link(url, nr=0)
        html = response.read()
        
        self.temp_files.append(PersistentTemporaryFile('_fa.html'))
        self.temp_files[-1].write(html)
        self.temp_files[-1].close()
        return self.temp_files[-1].name

    ######################################################################################################################

    feeds          = [
                       (u'Complete Issue', u'http://www.laweekly.com/syndication/issue/'),
                       (u'News', u'http://www.laweekly.com/syndication/section/news/'),
                       (u'Music', u'http://www.laweekly.com/syndication/section/music/'),
                       (u'Movies', u'http://www.laweekly.com/syndication/section/film/'),
                       (u'Restaurants', u'http://www.laweekly.com/syndication/section/dining/'),
                       (u'Music Events', u'http://laweekly.com/syndication/events?type=music'),
                       (u'Calendar Events', u'http://laweekly.com/syndication/events'),
                       (u'Restaurant Guide', u'http://laweekly.com/syndication/restaurants/search/'),
                       
                     ]
TonytheBookworm is offline   Reply With Quote