Hey Starson17 or anyone else for that matter. How do you check a mechanized follow to make sure it is a valid link? more specifically if I have a combination of feeds that mostly follow the url_regex of .*?\\/content\\/printVersion
but some of the feeds do not have that link inside. How do i test that?
I keep getting linknotfound errors on the event feeds because they do not contain a /content/printVersion link in them. In that cause I would like it to simply return the calling url.
here is the code I have thus far. Everything works except the music and events feeds because of the above mentioned issue.
Thanks.
Here is the section i'm having issues with
and here is the whole code
Spoiler:
Code:
#!/usr/bin/env python
__license__ = 'GPL v3'
__author__ = 'Tony Stegall'
__copyright__ = '2010, Tony Stegall or Tonythebookworm on mobileread.com'
__version__ = 'v1.01'
__date__ = '07, October 2010'
__description__ = 'La weekly mag'
'''
http://www.laweekly.com
'''
from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ptempfile import PersistentTemporaryFile
class LaWeekly(BasicNewsRecipe):
__author__ = 'Tony Stegall'
description = 'La Weekly Mag'
cover_url = 'http://assets.laweekly.com/img/citylogo-lg.png'
title = 'La WeeklyMag '
publisher = 'Laweekly.com'
category = 'News,US'
language = 'en'
timefmt = '[%a, %d %b, %Y]'
oldest_article = 15
max_articles_per_feed = 25
use_embedded_content = False
remove_javascript = True
######################################################################################################################
'''
We need to take and find all instances of /content/printVersion/
So in order to do this we take and setup a temp list
Then we turn on the flag to tell calibre/beautifulsoup that the articles are obfuscated
Then we take and get the obfuscated article (in our case the print version)
We take and create a browser and let calibre do all the work for us. It will open an internal browser and follow
then links that match the regular expression of .*?(\\/)(content)(\\/)(printVersion)(\\/)
so basically any link that looks like this /content/printVersion/
it takes and writes all the information to a temp html file. that the recipe/calibre will parse from.
And thats all that is needed for this recipe.
'''
temp_files = []
articles_are_obfuscated = True
def get_obfuscated_article(self, url):
br = self.get_browser()
print 'THE CURRENT URL IS: ', url
br.open(url)
response = br.follow_link(url_regex='.*?(\\/)(content)(\\/)(printVersion)(\\/)', nr = 0)
if response is None:
response = br.follow_link(url, nr=0)
html = response.read()
self.temp_files.append(PersistentTemporaryFile('_fa.html'))
self.temp_files[-1].write(html)
self.temp_files[-1].close()
return self.temp_files[-1].name
######################################################################################################################
feeds = [
(u'Complete Issue', u'http://www.laweekly.com/syndication/issue/'),
(u'News', u'http://www.laweekly.com/syndication/section/news/'),
(u'Music', u'http://www.laweekly.com/syndication/section/music/'),
(u'Movies', u'http://www.laweekly.com/syndication/section/film/'),
(u'Restaurants', u'http://www.laweekly.com/syndication/section/dining/'),
(u'Music Events', u'http://laweekly.com/syndication/events?type=music'),
(u'Calendar Events', u'http://laweekly.com/syndication/events'),
(u'Restaurant Guide', u'http://laweekly.com/syndication/restaurants/search/'),
]