View Single Post
Old 07-07-2011, 11:22 AM   #1
newnick
Junior Member
newnick began at the beginning.
 
Posts: 3
Karma: 10
Join Date: Jul 2011
Device: kindle 3 wifi
How get full article when good looking page do not have print version and same url?

Hello!
Sorry, for my bad English.
I need to get articles from the site sciencedirect. We at work have a subscription to this site, but there is always time for reading. Usually I make RSS feed by keywords, and use this code:
Spoiler:
Code:
class ScienceDirectSearch(BasicNewsRecipe):
    title                 = 'ScienceDirect Search: nonviral gene delivery cancer career'
    oldest_article        = 2
    max_articles_per_feed = 100
    language              = 'en'
    no_stylesheets        = True
    remove_javascript = True
    keep_only_tags     = [dict(name='div',attrs={'id':'articleContent'})]	

    feeds       = [
(u'ScienceDirect Search: nonviral gene delivery cancer career', u'http://rss.sciencedirect.com/getMessage?registrationId=JEBCKHJCKGBKREFGLECDJLBJJNEDNEBGPWDKMNFDLE')
]

This code working well, but pictures very small (Thumbnails), and page have links "Full-Size images". I trying to find some regesp for this versions, but it too complicated for me. I will try to explan:

In rss url links is
Code:
http://www.sciencedirect.com/science?_ob=GatewayURL&_origin=IRSSSEARCH&_method=citationSearch&_piikey=S0142961211001487&_version=1&md5=a9937225219b20142aafab27e5043b87
Then in browser url is:
Code:
http://www.sciencedirect.com/science/article/pii/S0142961211001487
"Full-Size images" url look:
Code:
http://www.sciencedirect.com/science/article/pii/S0142961211001487?_rdoc=1&_fmt=full&_origin=gateway&md5=12247b4a7282dff569e83636d280c9ca&artImgPref=F
But when I press to this link in browser same url, like before:
Code:
http://www.sciencedirect.com/science/article/pii/S0142961211001487
So I trying to use this code:
Spoiler:
Code:
class ScienceDirectSearch(BasicNewsRecipe):
    title                 = 'ScienceDirect Search: nonviral gene delivery cancer career'
    oldest_article        = 2
    max_articles_per_feed = 100
    language              = 'en'
    no_stylesheets        = True
    remove_javascript = True
    keep_only_tags     = [dict(name='div',attrs={'id':'articleContent'})]	

    feeds       = [
(u'ScienceDirect Search: nonviral gene delivery cancer career', u'http://rss.sciencedirect.com/getMessage?registrationId=JEBCKHJCKGBKREFGLECDJLBJJNEDNEBGPWDKMNFDLE')
]

    temp_files = []
    articles_are_obfuscated = True

    def get_obfuscated_article(self, url):
        br = self.get_browser()
        br.open(url)
        response = br.follow_link(url_regex='&artImgPref=F$', nr = 0)
        html = response.read()
        self.temp_files.append(PersistentTemporaryFile('_fa.html'))
        self.temp_files[-1].write(html)
        self.temp_files[-1].close()
        return self.temp_files[-1].name


But calibre say:
Spoiler:
Code:
Failed to download article: Incorporation of active DNA/cationic polymer polyplexes into hydrogel scaffolds from http://www.sciencedirect.com/science...0ade4cd9d0bc48
Traceback (most recent call last):
  File "site-packages\calibre\utils\threadpool.py", line 95, in run
  File "site-packages\calibre\web\feeds\news.py", line 856, in fetch_obfuscated_article
  File "c:\users\rg\appdata\local\temp\calibre_0.8.7_tmp_c5awao\calibre_0.8.7_gdmxs__recipes\recipe0.py", line 25, in get_obfuscated_article
NameError: global name 'PersistentTemporaryFile' is not defined


Сan someone explain to me how to get the full version of this article with pictures?

I have another question:
is it possible to change only the pictures in the article, ie if there are pictures in the article with the address http://.../small/.../image.jpg. Can I change them to the pictures with the address http://.../medium/.../image.jpg?
Thank you!
newnick is offline   Reply With Quote