MobileRead Forums - View Single Post

eschwartz · 08-27-2015, 02:14 AM

Quote:

Originally Posted by bobodude

Yes, that's the one,

hmm, I'm quite new to this, I mirrored part of the site, and it seems it is organized by issue, with html pages sorted accordingly, but I haven't had luck with the pdf's ...

Well, they don't have an index of the PDFs available, which makes it difficult. However, each page does link to the full list of subscriptions.

And I notice that they are behind a paywall, which means you will have to login, always fun with scripts.

To login, you will need the following command:

Code:

wget --save-cookies=cookies.txt --keep-session-cookies --post-data="username=MYUSERNAME&password=MYPASSWORD" --user-agent "Mozilla/5.0 (Windows NT 6.1; rv:40.0) Gecko/20100101 Firefox/40.0" https://newleftreview.org/customer/do_login --delete-after

Replace MYUSERNAME and MYPASSWORD with the obvious.

How to figure out the right command:

Spoiler:

To download all files:

Code:

wget --trust-server-names --content-disposition --load-cookies cookies.txt --recursive --level 3 --accept pdf,PDF --wait 5 --random-wait --user-agent "Mozilla/5.0 (Windows NT 6.1; rv:40.0) Gecko/20100101 Firefox/40.0" https://newleftreview.org

What this does:

Spoiler:

Notice the wait. This is because it is really impolite to hit up a website with hundreds of requests while you try to download every page on their website. Also they have your login information so they may ban you.

The wait is random, in order to confuse them. Additionally, wget pretends to be Firefox.

I just want to point out, this will by necessity take some time, but at least you don't have to pay attention, and do a lot of clicking. Just run wget overnight, or while you go out for the day, or just run it and pay attention to something else.
Also, hitting up websites in an automated way doesn't usually make them thrilled. I have included the configuration options which make wget go slowly and pretend to not be an automated tool, which I consider a wise move -- it severely lessens the chance that a smart webmaster will notice you and cancel your account and ip-block you and send their SWAT team after you, your family, and everyone you know

just kidding.

This may or may not be worth it to you, but at the very least, it should be instructive.

EDIT: In order to save a bit on the recursion counts, I downloaded the front page, and did some magic with an old CLI favorite, "sed". Basically, I regexed the hell out of it to assemble a list of URLs to each magazine issue.

I'm afraid, since every page has a full list of issues, you will still have to follow 328 links on each page, then follow them again.

But at least it only costs time.
RAM usage might get ridiculous as it tries to keep track of all those recursive urls. Or it might not, since I assume the wget developers had that concern when they created a recursive mode.
I have never tried large feats, so I wouldn't know.

Download the attached index.txt and open a command prompt in the same directory as the index.txt -- then use this command:

Code:

wget --trust-server-names --content-disposition --load-cookies cookies.txt --recursive --level 2 --accept pdf,PDF --wait 5 --random-wait --user-agent "Mozilla/5.0 (Windows NT 6.1; rv:40.0) Gecko/20100101 Firefox/40.0" --input-file index.txt

I dropped the --level to 2, and used --input-file to download a list of webpages from the index.txt file.