Quote:
Originally Posted by bobodude
Yes, that's the one,
hmm, I'm quite new to this, I mirrored part of the site, and it seems it is organized by issue, with html pages sorted accordingly, but I haven't had luck with the pdf's ...
|
Well, they don't have an index of the PDFs available, which makes it difficult. However, each page does link to the full list of subscriptions.
And I notice that they are behind a paywall, which means you will have to login, always fun with scripts.
To login, you will need the following command:
Code:
wget --save-cookies=cookies.txt --keep-session-cookies --post-data="username=MYUSERNAME&password=MYPASSWORD" --user-agent "Mozilla/5.0 (Windows NT 6.1; rv:40.0) Gecko/20100101 Firefox/40.0" https://newleftreview.org/customer/do_login --delete-after
Replace MYUSERNAME and MYPASSWORD with the obvious.
How to figure out the right command:
Spoiler:
By examining the login page using Firefox's Inspect Element, I saw the login form -- it "POST"s data to "https://newleftreview.org/customer/do_login", and uses fields with the names "username" and "password".
So, there are the values for `--post-data`.
--save-cookies and --keep-session-cookies ensures that the necessary web cookies with the login data are saved to the file cookies.txt
--user-agent makes wget pretend to be Firefox 40, running on Windows 7
--delete-after deletes the downloaded login page, which we don't want to keep.
To download all files:
Code:
wget --trust-server-names --content-disposition --load-cookies cookies.txt --recursive --level 3 --accept pdf,PDF --wait 5 --random-wait --user-agent "Mozilla/5.0 (Windows NT 6.1; rv:40.0) Gecko/20100101 Firefox/40.0" https://newleftreview.org
What this does:
Notice the wait. This is because it is really impolite to hit up a website with hundreds of requests while you try to download every page on their website. Also they have your login information so they may ban you.

The wait is random, in order to confuse them. Additionally, wget pretends to be Firefox.
I just want to point out, this will by necessity take some time, but at least you don't have to pay attention, and do a lot of clicking. Just run wget overnight, or while you go out for the day, or just run it and pay attention to something else.
Also, hitting up websites in an automated way doesn't usually make them thrilled. I have included the configuration options which make wget go slowly and pretend to not be an automated tool, which I consider a wise move -- it
severely lessens the chance that a smart webmaster will notice you and cancel your account and ip-block you and send their SWAT team after you, your family, and everyone you know

just kidding.
This may or may not be worth it to you, but at the very least, it should be instructive.
EDIT: In order to save a bit on the recursion counts, I downloaded the front page, and did some magic with an old CLI favorite, "sed". Basically, I regexed the hell out of it to assemble a list of URLs to each magazine issue.

I'm afraid, since every page has a full list of issues, you will still have to follow 328 links on each page, then follow them again.

But at least it only costs time.
RAM usage might get ridiculous as it tries to keep track of all those recursive urls. Or it might not, since I assume the wget developers had that concern when they created a recursive mode.
I have never tried large feats, so I wouldn't know.
Download the attached index.txt and open a command prompt in the same directory as the index.txt -- then use this command:
Code:
wget --trust-server-names --content-disposition --load-cookies cookies.txt --recursive --level 2 --accept pdf,PDF --wait 5 --random-wait --user-agent "Mozilla/5.0 (Windows NT 6.1; rv:40.0) Gecko/20100101 Firefox/40.0" --input-file index.txt
I dropped the --level to 2, and used --input-file to download a list of webpages from the index.txt file.