View Single Post
Old 09-19-2010, 03:32 PM   #2769
TonytheBookworm
Addict
TonytheBookworm is on a distinguished road
 
TonytheBookworm's Avatar
 
Posts: 264
Karma: 62
Join Date: May 2010
Device: kindle 2, kindle 3, Kindle fire
Quote:
Originally Posted by marbs View Post
i need to go over your code slowly. i am not sure i understand it at all. can i use it as is? i would love an explanation when you have the time.

BTW, the IT address is "http://it.themarker.com/tmit/article/XXXXX"
and the print version is "http://it.themarker.com/tmit/PrintArticle/XXXXX"

how would you do the clean up for the different pages (or should i just leave it?)

thanks again for all your help. i really do appreciate it.
thats what i get for posting code without testing it... Anyway.
this might do the trick. (i can't seem to get it to find it.themarket link) so your gonna have to be my eyes in the field on this one. Cause what happens is this. for instance you have cars.themarket.com when it goes to that link it converts it to themarket in the cases i have seen. if you know a specific url that i can test please let me know. because as i'm seeing things like law.themarket and cars.themarket and careers the market all revert to www.themarket.com/xxxxxxxxx and on on

here is what I have come up with thus far. sorry about the previous code.
Spoiler:

Code:
from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup, re

class AdvancedUserRecipe1283848012(BasicNewsRecipe):
    description   = 'TheMarker'
    cover_url      = 'http://static.ispot.co.il/wp-content/upload/2009/09/themarker.jpg'
    title          = u'The Marker1'
    language       = 'he'
    simultaneous_downloads = 5
    #delay                  = 6   
    remove_javascript     = True
    timefmt        = '[%a, %d %b, %Y]'
    oldest_article = 2
    #remove_tags = [dict(name='tr', attrs={'bgcolor':['#738A94']})          ]
    max_articles_per_feed = 10
    #extra_css='body{direction: rtl;} .article_description{direction: rtl; } a.article{direction: rtl; } .calibre_feed_description{direction: rtl; }'
    feeds          = [(u'Head Lines', u'http://www.themarker.com/tmc/content/xml/rss/hpfeed.xml'), 
                      (u'TA Market', u'http://www.themarker.com/tmc/content/xml/rss/sections/marketfeed.xml'),
                      (u'Real Estate', u'http://www.themarker.com/tmc/content/xml/rss/sections/realEstaterfeed.xml'),
                      (u'Wall Street & Global', u'http://www.themarker.com/tmc/content/xml/rss/sections/wallsfeed.xml'), 
                      (u'Law', u'http://www.themarker.com/tmc/content/xml/rss/sections/lawfeed.xml'), 
                      (u'Media', u'http://www.themarker.com/tmc/content/xml/rss/sections/mediafeed.xml'), 
                      (u'Consumer', u'http://www.themarker.com/tmc/content/xml/rss/sections/consumerfeed.xml'), 
                      (u'Career', u'http://www.themarker.com/tmc/content/xml/rss/sections/careerfeed.xml'), 
                      (u'Car', u'http://www.themarker.com/tmc/content/xml/rss/sections/carfeed.xml'), 
                      (u'High Tech', u'http://www.themarker.com/tmc/content/xml/rss/sections/hightechfeed.xml'), 
                      (u'Investor Guide', u'http://www.themarker.com/tmc/content/xml/rss/sections/investorGuidefeed.xml')]
    ##def print_version(self, url):
    # baseURL=url.replace('tmc/article.jhtml?ElementId=', 'ibo/misc/printFriendly.jhtml?ElementId=%2Fibo%2Frepositories%2Fstories%2Fm1_2000%2F')
     #  print 'BASE IS :', baseURL
      # s= baseURL + '.xml'
       #return s
       #http://www.themarker.com/tmc/article.jhtml?ElementId=zz20100918_6121
       #http://www.themarker.com/ibo/misc/printFriendly.jhtml?ElementId=%2Fibo%2Frepositories%2Fstories%2Fm1_2000%2Fzz20100918_6121.xml
       
       
    def print_version(self, url):
        print 'ORG URL IS: ', url
        split1 = url.split("=")
        print 'THE SPLIT IS: ', split1 
        weblinks = url
      
        if weblinks is not None:
            for link in weblinks:
                #---------------------------------------------------------
                #here we need some help with some regexpressions
                #we are trying to find it.themarker.com in a url
                #-----------------------------------------------------------
                re1='.*?'	# Non-greedy match on filler
                re2='(it\\.themarker\\.com)'	# Fully Qualified Domain Name 1
                rg = re.compile(re1+re2,re.IGNORECASE|re.DOTALL)
                m = rg.search(url)
                
                
                if m:
                 split2 = url.split("article/")
                 print 'FOUND IT: ', url
                 print_url = 'http://it.themarker.com/tmit/PrintArticle/' + split2[1]
                
                else:
                    print_url = 'http://www.themarker.com/ibo/misc/printFriendly.jhtml?ElementId=%2Fibo%2Frepositories%2Fstories%2Fm1_2000%2F' + split1[1]+'.xml'
                 
        print 'THIS URL WILL PRINT: ', print_url # this is a test string to see what the url is it will return
        return print_url

Last edited by TonytheBookworm; 09-19-2010 at 06:07 PM. Reason: modified code to find it.themarker.com error was in regex
TonytheBookworm is offline