Quote:
Originally Posted by marbs
i need to go over your code slowly. i am not sure i understand it at all. can i use it as is? i would love an explanation when you have the time.
BTW, the IT address is "http://it.themarker.com/tmit/article/XXXXX"
and the print version is "http://it.themarker.com/tmit/PrintArticle/XXXXX"
how would you do the clean up for the different pages (or should i just leave it?)
thanks again for all your help. i really do appreciate it. 
|
look at the updated code I posted. Test that on your end and see if it works for you. I changed the reg expression and it finds the link correctly on my end. it finds it.themarker.com and changes it and anything else it leaves as that themarker.com/********* stuff
here is the code
Spoiler:
Code:
from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup, re
class AdvancedUserRecipe1283848012(BasicNewsRecipe):
description = 'TheMarker'
cover_url = 'http://static.ispot.co.il/wp-content/upload/2009/09/themarker.jpg'
title = u'The Marker1'
language = 'he'
simultaneous_downloads = 5
#delay = 6
remove_javascript = True
timefmt = '[%a, %d %b, %Y]'
oldest_article = 2
#remove_tags = [dict(name='tr', attrs={'bgcolor':['#738A94']}) ]
max_articles_per_feed = 10
#extra_css='body{direction: rtl;} .article_description{direction: rtl; } a.article{direction: rtl; } .calibre_feed_description{direction: rtl; }'
feeds = [(u'Head Lines', u'http://www.themarker.com/tmc/content/xml/rss/hpfeed.xml'),
(u'TA Market', u'http://www.themarker.com/tmc/content/xml/rss/sections/marketfeed.xml'),
(u'Real Estate', u'http://www.themarker.com/tmc/content/xml/rss/sections/realEstaterfeed.xml'),
(u'Wall Street & Global', u'http://www.themarker.com/tmc/content/xml/rss/sections/wallsfeed.xml'),
(u'Law', u'http://www.themarker.com/tmc/content/xml/rss/sections/lawfeed.xml'),
(u'Media', u'http://www.themarker.com/tmc/content/xml/rss/sections/mediafeed.xml'),
(u'Consumer', u'http://www.themarker.com/tmc/content/xml/rss/sections/consumerfeed.xml'),
(u'Career', u'http://www.themarker.com/tmc/content/xml/rss/sections/careerfeed.xml'),
(u'Car', u'http://www.themarker.com/tmc/content/xml/rss/sections/carfeed.xml'),
(u'High Tech', u'http://www.themarker.com/tmc/content/xml/rss/sections/hightechfeed.xml'),
(u'Investor Guide', u'http://www.themarker.com/tmc/content/xml/rss/sections/investorGuidefeed.xml')]
##def print_version(self, url):
# baseURL=url.replace('tmc/article.jhtml?ElementId=', 'ibo/misc/printFriendly.jhtml?ElementId=%2Fibo%2Frepositories%2Fstories%2Fm1_2000%2F')
# print 'BASE IS :', baseURL
# s= baseURL + '.xml'
#return s
#http://www.themarker.com/tmc/article.jhtml?ElementId=zz20100918_6121
#http://www.themarker.com/ibo/misc/printFriendly.jhtml?ElementId=%2Fibo%2Frepositories%2Fstories%2Fm1_2000%2Fzz20100918_6121.xml
def print_version(self, url):
print 'ORG URL IS: ', url
split1 = url.split("=")
print 'THE SPLIT IS: ', split1
weblinks = url
if weblinks is not None:
for link in weblinks:
#---------------------------------------------------------
#here we need some help with some regexpressions
#we are trying to find it.themarker.com in a url
#-----------------------------------------------------------
re1='.*?' # Non-greedy match on filler
re2='(it\\.themarker\\.com)' # Fully Qualified Domain Name 1
rg = re.compile(re1+re2,re.IGNORECASE|re.DOTALL)
m = rg.search(url)
if m:
split2 = url.split("article/")
print 'FOUND it: ', url
print_url = 'http://it.themarker.com/tmit/PrintArticle/' + split2[1]
else:
print_url = 'http://www.themarker.com/ibo/misc/printFriendly.jhtml?ElementId=%2Fibo%2Frepositories%2Fstories%2Fm1_2000%2F' + split1[1]+'.xml'
print 'THIS URL WILL PRINT: ', print_url # this is a test string to see what the url is it will return
return print_url