Batch download pdf's ? - Page 2

eschwartz · 08-24-2015, 11:02 PM

Show me an example page that lists all the PDFs you want.

Is the site you mentioned above http://newleftreview.org -- I don't see any sort of main index page.

bobodude · 08-25-2015, 03:54 PM

Yes, that's the one,

hmm, I'm quite new to this, I mirrored part of the site, and it seems it is organized by issue, with html pages sorted accordingly, but I haven't had luck with the pdf's ...

mitrodhar · 08-26-2015, 06:31 PM

Quote:

Originally Posted by bobodude

I've been looking around the web for a simple and fast way to download many pdf's from a website, but haven't found a solution I am able to figure out,

does anyone know of a simple way to do this ?

Thnaks !

Where from? I suppose the zotero plugin may help:
https://www.zotero.org/download/

eschwartz · 08-27-2015, 02:14 AM

Quote:

Originally Posted by bobodude

Yes, that's the one,

hmm, I'm quite new to this, I mirrored part of the site, and it seems it is organized by issue, with html pages sorted accordingly, but I haven't had luck with the pdf's ...

Well, they don't have an index of the PDFs available, which makes it difficult. However, each page does link to the full list of subscriptions.

And I notice that they are behind a paywall, which means you will have to login, always fun with scripts.

To login, you will need the following command:

Code:

wget --save-cookies=cookies.txt --keep-session-cookies --post-data="username=MYUSERNAME&password=MYPASSWORD" --user-agent "Mozilla/5.0 (Windows NT 6.1; rv:40.0) Gecko/20100101 Firefox/40.0" https://newleftreview.org/customer/do_login --delete-after

Replace MYUSERNAME and MYPASSWORD with the obvious.

How to figure out the right command:

Spoiler:

To download all files:

Code:

wget --trust-server-names --content-disposition --load-cookies cookies.txt --recursive --level 3 --accept pdf,PDF --wait 5 --random-wait --user-agent "Mozilla/5.0 (Windows NT 6.1; rv:40.0) Gecko/20100101 Firefox/40.0" https://newleftreview.org

What this does:

Spoiler:

Notice the wait. This is because it is really impolite to hit up a website with hundreds of requests while you try to download every page on their website. Also they have your login information so they may ban you.

The wait is random, in order to confuse them. Additionally, wget pretends to be Firefox.

I just want to point out, this will by necessity take some time, but at least you don't have to pay attention, and do a lot of clicking. Just run wget overnight, or while you go out for the day, or just run it and pay attention to something else.
Also, hitting up websites in an automated way doesn't usually make them thrilled. I have included the configuration options which make wget go slowly and pretend to not be an automated tool, which I consider a wise move -- it severely lessens the chance that a smart webmaster will notice you and cancel your account and ip-block you and send their SWAT team after you, your family, and everyone you know

just kidding.

This may or may not be worth it to you, but at the very least, it should be instructive.

EDIT: In order to save a bit on the recursion counts, I downloaded the front page, and did some magic with an old CLI favorite, "sed". Basically, I regexed the hell out of it to assemble a list of URLs to each magazine issue.

I'm afraid, since every page has a full list of issues, you will still have to follow 328 links on each page, then follow them again.

But at least it only costs time.
RAM usage might get ridiculous as it tries to keep track of all those recursive urls. Or it might not, since I assume the wget developers had that concern when they created a recursive mode.
I have never tried large feats, so I wouldn't know.

Download the attached index.txt and open a command prompt in the same directory as the index.txt -- then use this command:

Code:

wget --trust-server-names --content-disposition --load-cookies cookies.txt --recursive --level 2 --accept pdf,PDF --wait 5 --random-wait --user-agent "Mozilla/5.0 (Windows NT 6.1; rv:40.0) Gecko/20100101 Firefox/40.0" --input-file index.txt

I dropped the --level to 2, and used --input-file to download a list of webpages from the index.txt file.

bobodude · 08-29-2015, 04:21 AM

mithrodar,

yes I am also using zotero, and it is great for certain websites, for example it renames academic pdf's, but I can't get it to work on Proquest, whic is a shame, as it would give me an option to batch download pdf's easily ...

eschwartz,

WOW, thanks for the detailed (and noob firendly) reply, I will try and get it to work for me, thanks for all the great tips !!!

I will tell you, if i get it to work, ...

OMG, I'm using your commands on another website, and it's working like magic !!
I owe you a big one, 1000 thanks !!

One question, do you think it would be risky using a university vpn connection, to access this (or another site), using the above commands ?

And maybe one more question, if you have the time and it's not too much bother, could you give some details on how you:"I regexed the hell out of it to assemble a list of URLs to each magazine issue.",
as I am trying to do this in another site (for free content), and would this work in windows ?

eschwartz · 08-30-2015, 01:22 AM

University VPN connections -- I can't imagine why they would have a problem, unless they have a rule against using too much bandwidth in a small span of time?
Assuming you tell wget to wait, in order to protect yourself from the wrath of the website's webmaster, you should be protected from the wrath of a university VPN admin.

The basic idea is to always pretend to be the Firefox browser... nice and innocuous... and balance the need for immediacy with the need to stagger your downloads down to a human-compatible "click" rate. The two together means that in the event someone dislikes people using wget (which is not actually a given), then "better safe than sorry" -- they won't know anyway.

As for how I regexed the hell out of it, it will be different for every site, the basic idea is to learn how to write regexes -- this website is the one that taught me, I like their noob-friendly explanations

http://regular-expressions.info
They can explain better than I can.
Regular Expressions (regexes) are a powerful tool for cutting apart and putting back together text, and you'd be surprised at how useful they can be in general. e.g. LibreOffice allows regular expressions in Find and Replace.

There are various programs that can perform regexes, the tools I happen to use are sed or from within vim --which probably won't help you much.

A quick google search turns up several applications and sites that can regex text files or copy-pasted info. The makers of the above tutorial also have a regex program.

In this specific case, I took a look at the HTML of the front page, found a loooooooooong block of text that had links to the separate issues, deleted everything above and below the block, and then ran a couple regexes that I didn't bother to remember, which progressively cleaned it until there was one plain URL per line.

bobodude · 08-31-2015, 05:52 PM

Thanks for the additional info, I'll look into it !!!

And thanks again for all the tips you posted, I really learnt alot !!!

and things I've been wanting to know for a while ...

eschwartz · 08-31-2015, 05:59 PM

Happy to be of service.

And of course feel free to ask any questions you might have as you figure things out.

bobodude · 09-01-2015, 11:54 AM

As a matter of fact, there was one more thing (I wasn't sure if I'd be asking for too much info ...),

to download articles through wget from the newleftreview, I need to be identified as being connected through a university internet connection,

I was thinking of doing this through a wget proxy connection, and googled this and found a couple of wget commands, but haven't had any luck sofar,

so if there is a command you know of that works for you, that would be great,
or if you know of another way ...

I have seen that one can alter the wgetrc file, to configure it to use a proxy server, however as mentioned earlier, I can't seem to find this file ...

thanks again !!!

eschwartz · 09-01-2015, 05:07 PM

Quote:

to download articles through wget from the newleftreview, I need to be identified as being connected through a university internet connection,

What do you mean by this? Does the website allow people to pull articles via wget for educational purposes? Good news!
Although I already detailed how to masquerade as Firefox, so they would never know.

But simply using the university proxy would be easier

The standard way to use a proxy is to:

Code:

set https_proxy=https://proxy.server.com

Or otherwise set the https_proxy permanently through the Environment Variables dialog thingy available somewhere in the Start Menu searchbox.

Or the far more usable Rapid Environment Editor.

Many programs know how to obey this environment variable, including wget.

You can also specify a wgetrc file using

Code:

wget --config C:\path\to\config\file   [more options and websites and stuff]

You can use the "wgetrc" envrionment variable to specify the location of your wgetrc file (on linux it automatically looks in $HOME/.wgetrc but I cannot find any comparable location for Windows).

bobodude · 09-04-2015, 06:32 AM

Thanks again !!!

(no more questions for a while, promise ...)

08-29-2015, 04:21 AM	#20
bobodude Connoisseur Posts: 70 Karma: 1800048 Join Date: Oct 2014 Device: BooX M96	mithrodar, yes I am also using zotero, and it is great for certain websites, for example it renames academic pdf's, but I can't get it to work on Proquest, whic is a shame, as it would give me an option to batch download pdf's easily ... eschwartz, WOW, thanks for the detailed (and noob firendly) reply, I will try and get it to work for me, thanks for all the great tips !!! I will tell you, if i get it to work, ... OMG, I'm using your commands on another website, and it's working like magic !! I owe you a big one, 1000 thanks !! One question, do you think it would be risky using a university vpn connection, to access this (or another site), using the above commands ? And maybe one more question, if you have the time and it's not too much bother, could you give some details on how you:"I regexed the hell out of it to assemble a list of URLs to each magazine issue.", as I am trying to do this in another site (for free content), and would this work in windows ? Last edited by bobodude; 08-29-2015 at 05:03 AM.

08-30-2015, 01:22 AM	#21
eschwartz Ex-Helpdesk Junkie Posts: 19,421 Karma: 85400180 Join Date: Nov 2012 Location: The Beaten Path, USA, Roundworld, This Side of Infinity Device: Kindle Touch fw5.3.7 (Wifi only)	University VPN connections -- I can't imagine why they would have a problem, unless they have a rule against using too much bandwidth in a small span of time? Assuming you tell wget to wait, in order to protect yourself from the wrath of the website's webmaster, you should be protected from the wrath of a university VPN admin. The basic idea is to always pretend to be the Firefox browser... nice and innocuous... and balance the need for immediacy with the need to stagger your downloads down to a human-compatible "click" rate. The two together means that in the event someone dislikes people using wget (which is not actually a given), then "better safe than sorry" -- they won't know anyway. As for how I regexed the hell out of it, it will be different for every site, the basic idea is to learn how to write regexes -- this website is the one that taught me, I like their noob-friendly explanations http://regular-expressions.info They can explain better than I can. Regular Expressions (regexes) are a powerful tool for cutting apart and putting back together text, and you'd be surprised at how useful they can be in general. e.g. LibreOffice allows regular expressions in Find and Replace. There are various programs that can perform regexes, the tools I happen to use are sed or from within vim --which probably won't help you much. A quick google search turns up several applications and sites that can regex text files or copy-pasted info. The makers of the above tutorial also have a regex program. In this specific case, I took a look at the HTML of the front page, found a loooooooooong block of text that had links to the separate issues, deleted everything above and below the block, and then ran a couple regexes that I didn't bother to remember, which progressively cleaned it until there was one plain URL per line. Last edited by eschwartz; 08-30-2015 at 01:37 AM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Request Batch export annotated pdf	gotmilt	enTourage eDGe	2	11-18-2011 04:57 PM
PDF to prc/azw Batch Conversion	xsolitudex	PDF	2	09-04-2010 10:19 AM
Classic Batch download of B&N eBooks?	mgmueller	Barnes & Noble NOOK	5	02-08-2010 12:01 PM
HTML to PDF batch converter	sputnik	PDF	3	07-07-2009 04:25 AM

08-24-2015, 11:02 PM	#16
eschwartz Ex-Helpdesk Junkie Posts: 19,421 Karma: 85400180 Join Date: Nov 2012 Location: The Beaten Path, USA, Roundworld, This Side of Infinity Device: Kindle Touch fw5.3.7 (Wifi only)	Show me an example page that lists all the PDFs you want. Is the site you mentioned above http://newleftreview.org -- I don't see any sort of main index page.

08-25-2015, 03:54 PM	#17
bobodude Connoisseur Posts: 70 Karma: 1800048 Join Date: Oct 2014 Device: BooX M96	Yes, that's the one, hmm, I'm quite new to this, I mirrored part of the site, and it seems it is organized by issue, with html pages sorted accordingly, but I haven't had luck with the pdf's ...

08-31-2015, 05:52 PM	#22
bobodude Connoisseur Posts: 70 Karma: 1800048 Join Date: Oct 2014 Device: BooX M96	Thanks for the additional info, I'll look into it !!! And thanks again for all the tips you posted, I really learnt alot !!! and things I've been wanting to know for a while ...

08-31-2015, 05:59 PM	#23
eschwartz Ex-Helpdesk Junkie Posts: 19,421 Karma: 85400180 Join Date: Nov 2012 Location: The Beaten Path, USA, Roundworld, This Side of Infinity Device: Kindle Touch fw5.3.7 (Wifi only)	Happy to be of service. And of course feel free to ask any questions you might have as you figure things out.

09-01-2015, 11:54 AM	#24
bobodude Connoisseur Posts: 70 Karma: 1800048 Join Date: Oct 2014 Device: BooX M96	As a matter of fact, there was one more thing (I wasn't sure if I'd be asking for too much info ...), to download articles through wget from the newleftreview, I need to be identified as being connected through a university internet connection, I was thinking of doing this through a wget proxy connection, and googled this and found a couple of wget commands, but haven't had any luck sofar, so if there is a command you know of that works for you, that would be great, or if you know of another way ... I have seen that one can alter the wgetrc file, to configure it to use a proxy server, however as mentioned earlier, I can't seem to find this file ... thanks again !!!

09-04-2015, 06:32 AM	#26
bobodude Connoisseur Posts: 70 Karma: 1800048 Join Date: Oct 2014 Device: BooX M96	Thanks again !!! (no more questions for a while, promise ...)

Advert

Advert