Quote:
Originally Posted by marbs
i need to go over your code slowly. i am not sure i understand it at all. can i use it as is? i would love an explanation when you have the time.
BTW, the IT address is "http://it.themarker.com/tmit/article/XXXXX"
and the print version is "http://it.themarker.com/tmit/PrintArticle/XXXXX"
how would you do the clean up for the different pages (or should i just leave it?)
thanks again for all your help. i really do appreciate it. 
|
thats what i get for posting code without testing it... Anyway.
this might do the trick. (i can't seem to get it to find it.themarket link) so your gonna have to be my eyes in the field on this one. Cause what happens is this. for instance you have cars.themarket.com when it goes to that link it converts it to themarket in the cases i have seen. if you know a specific url that i can test please let me know. because as i'm seeing things like law.themarket and cars.themarket and careers the market all revert to
www.themarket.com/xxxxxxxxx and on on
here is what I have come up with thus far. sorry about the previous code.
Spoiler:
Code:
from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup, re
class AdvancedUserRecipe1283848012(BasicNewsRecipe):
description = 'TheMarker'
cover_url = 'http://static.ispot.co.il/wp-content/upload/2009/09/themarker.jpg'
title = u'The Marker1'
language = 'he'
simultaneous_downloads = 5
#delay = 6
remove_javascript = True
timefmt = '[%a, %d %b, %Y]'
oldest_article = 2
#remove_tags = [dict(name='tr', attrs={'bgcolor':['#738A94']}) ]
max_articles_per_feed = 10
#extra_css='body{direction: rtl;} .article_description{direction: rtl; } a.article{direction: rtl; } .calibre_feed_description{direction: rtl; }'
feeds = [(u'Head Lines', u'http://www.themarker.com/tmc/content/xml/rss/hpfeed.xml'),
(u'TA Market', u'http://www.themarker.com/tmc/content/xml/rss/sections/marketfeed.xml'),
(u'Real Estate', u'http://www.themarker.com/tmc/content/xml/rss/sections/realEstaterfeed.xml'),
(u'Wall Street & Global', u'http://www.themarker.com/tmc/content/xml/rss/sections/wallsfeed.xml'),
(u'Law', u'http://www.themarker.com/tmc/content/xml/rss/sections/lawfeed.xml'),
(u'Media', u'http://www.themarker.com/tmc/content/xml/rss/sections/mediafeed.xml'),
(u'Consumer', u'http://www.themarker.com/tmc/content/xml/rss/sections/consumerfeed.xml'),
(u'Career', u'http://www.themarker.com/tmc/content/xml/rss/sections/careerfeed.xml'),
(u'Car', u'http://www.themarker.com/tmc/content/xml/rss/sections/carfeed.xml'),
(u'High Tech', u'http://www.themarker.com/tmc/content/xml/rss/sections/hightechfeed.xml'),
(u'Investor Guide', u'http://www.themarker.com/tmc/content/xml/rss/sections/investorGuidefeed.xml')]
##def print_version(self, url):
# baseURL=url.replace('tmc/article.jhtml?ElementId=', 'ibo/misc/printFriendly.jhtml?ElementId=%2Fibo%2Frepositories%2Fstories%2Fm1_2000%2F')
# print 'BASE IS :', baseURL
# s= baseURL + '.xml'
#return s
#http://www.themarker.com/tmc/article.jhtml?ElementId=zz20100918_6121
#http://www.themarker.com/ibo/misc/printFriendly.jhtml?ElementId=%2Fibo%2Frepositories%2Fstories%2Fm1_2000%2Fzz20100918_6121.xml
def print_version(self, url):
print 'ORG URL IS: ', url
split1 = url.split("=")
print 'THE SPLIT IS: ', split1
weblinks = url
if weblinks is not None:
for link in weblinks:
#---------------------------------------------------------
#here we need some help with some regexpressions
#we are trying to find it.themarker.com in a url
#-----------------------------------------------------------
re1='.*?' # Non-greedy match on filler
re2='(it\\.themarker\\.com)' # Fully Qualified Domain Name 1
rg = re.compile(re1+re2,re.IGNORECASE|re.DOTALL)
m = rg.search(url)
if m:
split2 = url.split("article/")
print 'FOUND IT: ', url
print_url = 'http://it.themarker.com/tmit/PrintArticle/' + split2[1]
else:
print_url = 'http://www.themarker.com/ibo/misc/printFriendly.jhtml?ElementId=%2Fibo%2Frepositories%2Fstories%2Fm1_2000%2F' + split1[1]+'.xml'
print 'THIS URL WILL PRINT: ', print_url # this is a test string to see what the url is it will return
return print_url