View Single Post
Old 08-27-2015, 02:14 AM   #19
eschwartz
Ex-Helpdesk Junkie
eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.
 
eschwartz's Avatar
 
Posts: 19,421
Karma: 85400180
Join Date: Nov 2012
Location: The Beaten Path, USA, Roundworld, This Side of Infinity
Device: Kindle Touch fw5.3.7 (Wifi only)
Quote:
Originally Posted by bobodude View Post
Yes, that's the one,

hmm, I'm quite new to this, I mirrored part of the site, and it seems it is organized by issue, with html pages sorted accordingly, but I haven't had luck with the pdf's ...
Well, they don't have an index of the PDFs available, which makes it difficult. However, each page does link to the full list of subscriptions.

And I notice that they are behind a paywall, which means you will have to login, always fun with scripts.

To login, you will need the following command:
Code:
wget --save-cookies=cookies.txt --keep-session-cookies --post-data="username=MYUSERNAME&password=MYPASSWORD" --user-agent "Mozilla/5.0 (Windows NT 6.1; rv:40.0) Gecko/20100101 Firefox/40.0" https://newleftreview.org/customer/do_login --delete-after
Replace MYUSERNAME and MYPASSWORD with the obvious.

How to figure out the right command:
Spoiler:

By examining the login page using Firefox's Inspect Element, I saw the login form -- it "POST"s data to "https://newleftreview.org/customer/do_login", and uses fields with the names "username" and "password".
So, there are the values for `--post-data`.
--save-cookies and --keep-session-cookies ensures that the necessary web cookies with the login data are saved to the file cookies.txt
--user-agent makes wget pretend to be Firefox 40, running on Windows 7
--delete-after deletes the downloaded login page, which we don't want to keep.


To download all files:
Code:
wget --trust-server-names --content-disposition --load-cookies cookies.txt --recursive --level 3 --accept pdf,PDF --wait 5 --random-wait --user-agent "Mozilla/5.0 (Windows NT 6.1; rv:40.0) Gecko/20100101 Firefox/40.0" https://newleftreview.org
What this does:
Spoiler:

--trust-server-names makes wget use the filename of a redirect when saving.
--content-disposition makes wget use the Content-Disposition header to determine the filename to save as.
Only one of these is necessary, but depending on the website and their download mechanism, it might be either one.

--load-cookies will make wget read the saved login info from the cookies.txt file we saved in the last step.
--recursive means we follow all links and download those too.
--level makes wget follow links 3 times. The first time downloads each magazine issue, the second time downloads the articles linked from each issue, the third time downloads the PDF linked from the article.
--wait tells wget to take a five-second break in between downloads
--random-wait makes wget average out the above wait, between 2.5 and 7.5 seconds
--user-agent makes wget pretend to be Firefox 40, running on Windows 7


Notice the wait. This is because it is really impolite to hit up a website with hundreds of requests while you try to download every page on their website. Also they have your login information so they may ban you.
The wait is random, in order to confuse them. Additionally, wget pretends to be Firefox.


I just want to point out, this will by necessity take some time, but at least you don't have to pay attention, and do a lot of clicking. Just run wget overnight, or while you go out for the day, or just run it and pay attention to something else.
Also, hitting up websites in an automated way doesn't usually make them thrilled. I have included the configuration options which make wget go slowly and pretend to not be an automated tool, which I consider a wise move -- it severely lessens the chance that a smart webmaster will notice you and cancel your account and ip-block you and send their SWAT team after you, your family, and everyone you know just kidding.


This may or may not be worth it to you, but at the very least, it should be instructive.






EDIT: In order to save a bit on the recursion counts, I downloaded the front page, and did some magic with an old CLI favorite, "sed". Basically, I regexed the hell out of it to assemble a list of URLs to each magazine issue.
I'm afraid, since every page has a full list of issues, you will still have to follow 328 links on each page, then follow them again. But at least it only costs time.
RAM usage might get ridiculous as it tries to keep track of all those recursive urls. Or it might not, since I assume the wget developers had that concern when they created a recursive mode.
I have never tried large feats, so I wouldn't know.


Download the attached index.txt and open a command prompt in the same directory as the index.txt -- then use this command:

Code:
wget --trust-server-names --content-disposition --load-cookies cookies.txt --recursive --level 2 --accept pdf,PDF --wait 5 --random-wait --user-agent "Mozilla/5.0 (Windows NT 6.1; rv:40.0) Gecko/20100101 Firefox/40.0" --input-file index.txt
I dropped the --level to 2, and used --input-file to download a list of webpages from the index.txt file.
Attached Files
File Type: txt index.txt (10.2 KB, 120 views)

Last edited by eschwartz; 08-27-2015 at 02:45 AM.
eschwartz is offline   Reply With Quote