View Single Post
Old 09-15-2011, 03:32 AM   #3
JayKindle
Connoisseur
JayKindle began at the beginning.
 
JayKindle's Avatar
 
Posts: 69
Karma: 10
Join Date: Sep 2011
Device: Kindle Fire HD 8
Okay a little more info. The site I am trying to fetch has a Print feature--which have a cleaner layout--but still has ad banners.

I was trying to follow the recipe for making links to fetch the data from a Print page instead. But I am having problems knowing where I add what to the code -- or more like what do I put to make this work.

Here is a normal link to that site:
http://www.mixingonbeat.com/phpbb/viewtopic.php?t=6452

Here is a print link to that site:
http://www.mixingonbeat.com/phpbb/vi...ote=viewresult

Here is their RSS Feed to that page:
http://www.mixingonbeat.com/phpbb/rss.php?t=6452

Here is the code I am trying to work with:

Spoiler:
'''
We need to take and find all instances of /content/printVersion/
So in order to do this we take and setup a temp list
Then we turn on the flag to tell calibre/beautifulsoup that the articles are obfuscated
Then we take and get the obfuscated article (in our case the print version)
We take and create a browser and let calibre do all the work for us. It will open an internal browser and follow
then links that match the regular expression of .*?(\\/)(content)(\\/)(printVersion)(\\/)
so basically any link that looks like this /content/printVersion/
it takes and writes all the information to a temp html file. that the recipe/calibre will parse from.
And thats all that is needed for this recipe.
'''

temp_files = []
articles_are_obfuscated = True

def get_obfuscated_article(self, url):
br = self.get_browser()
print 'THE CURRENT URL IS: ', url
br.open(url)
'''
we need to use a try catch block:
what this does is trys to do an operation and if it fails instead of crashing it simply catchs it and does
something with the error.
So in our case we take and check to see if we can follow /content/printVersion, then if we can't
then we simply pass it back the original calling url
'''

try:
response = br.follow_link(url_regex='.*?(\\/)(content)(\\/)(printVersion)(\\/)', nr = 0)
html = response.read()
except:
response = br.open(url)
html = response.read()

self.temp_files.append(PersistentTemporaryFile('_f a.html'))
self.temp_files[-1].write(html)
self.temp_files[-1].close()
return self.temp_files[-1].name


I really hope someone can help. Thanks.
JayKindle is offline   Reply With Quote