![]() |
#16 |
Ex-Helpdesk Junkie
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 19,421
Karma: 85400180
Join Date: Nov 2012
Location: The Beaten Path, USA, Roundworld, This Side of Infinity
Device: Kindle Touch fw5.3.7 (Wifi only)
|
Show me an example page that lists all the PDFs you want.
![]() Is the site you mentioned above http://newleftreview.org -- I don't see any sort of main index page. |
![]() |
![]() |
![]() |
#17 |
Connoisseur
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 70
Karma: 1800048
Join Date: Oct 2014
Device: BooX M96
|
Yes, that's the one,
hmm, I'm quite new to this, I mirrored part of the site, and it seems it is organized by issue, with html pages sorted accordingly, but I haven't had luck with the pdf's ... |
![]() |
![]() |
Advert | |
|
![]() |
#18 | |
Banned
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 8
Karma: 3234370
Join Date: Aug 2015
Device: Calibre
|
Quote:
Where from? I suppose the zotero plugin may help: https://www.zotero.org/download/ |
|
![]() |
![]() |
![]() |
#19 | |
Ex-Helpdesk Junkie
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 19,421
Karma: 85400180
Join Date: Nov 2012
Location: The Beaten Path, USA, Roundworld, This Side of Infinity
Device: Kindle Touch fw5.3.7 (Wifi only)
|
Quote:
And I notice that they are behind a paywall, which means you will have to login, always fun with scripts. ![]() To login, you will need the following command: Code:
wget --save-cookies=cookies.txt --keep-session-cookies --post-data="username=MYUSERNAME&password=MYPASSWORD" --user-agent "Mozilla/5.0 (Windows NT 6.1; rv:40.0) Gecko/20100101 Firefox/40.0" https://newleftreview.org/customer/do_login --delete-after ![]() How to figure out the right command: Spoiler:
To download all files: Code:
wget --trust-server-names --content-disposition --load-cookies cookies.txt --recursive --level 3 --accept pdf,PDF --wait 5 --random-wait --user-agent "Mozilla/5.0 (Windows NT 6.1; rv:40.0) Gecko/20100101 Firefox/40.0" https://newleftreview.org Spoiler:
Notice the wait. This is because it is really impolite to hit up a website with hundreds of requests while you try to download every page on their website. Also they have your login information so they may ban you. ![]() The wait is random, in order to confuse them. Additionally, wget pretends to be Firefox. I just want to point out, this will by necessity take some time, but at least you don't have to pay attention, and do a lot of clicking. Just run wget overnight, or while you go out for the day, or just run it and pay attention to something else. Also, hitting up websites in an automated way doesn't usually make them thrilled. I have included the configuration options which make wget go slowly and pretend to not be an automated tool, which I consider a wise move -- it severely lessens the chance that a smart webmaster will notice you and cancel your account and ip-block you and send their SWAT team after you, your family, and everyone you know ![]() This may or may not be worth it to you, but at the very least, it should be instructive. ![]() ![]() EDIT: In order to save a bit on the recursion counts, I downloaded the front page, and did some magic with an old CLI favorite, "sed". Basically, I regexed the hell out of it to assemble a list of URLs to each magazine issue. ![]() I'm afraid, since every page has a full list of issues, you will still have to follow 328 links on each page, then follow them again. ![]() RAM usage might get ridiculous as it tries to keep track of all those recursive urls. Or it might not, since I assume the wget developers had that concern when they created a recursive mode. I have never tried large feats, so I wouldn't know. Download the attached index.txt and open a command prompt in the same directory as the index.txt -- then use this command: Code:
wget --trust-server-names --content-disposition --load-cookies cookies.txt --recursive --level 2 --accept pdf,PDF --wait 5 --random-wait --user-agent "Mozilla/5.0 (Windows NT 6.1; rv:40.0) Gecko/20100101 Firefox/40.0" --input-file index.txt Last edited by eschwartz; 08-27-2015 at 02:45 AM. |
|
![]() |
![]() |
![]() |
#20 |
Connoisseur
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 70
Karma: 1800048
Join Date: Oct 2014
Device: BooX M96
|
mithrodar,
yes I am also using zotero, and it is great for certain websites, for example it renames academic pdf's, but I can't get it to work on Proquest, whic is a shame, as it would give me an option to batch download pdf's easily ... eschwartz, WOW, thanks for the detailed (and noob firendly) reply, I will try and get it to work for me, thanks for all the great tips !!! I will tell you, if i get it to work, ... OMG, I'm using your commands on another website, and it's working like magic !! I owe you a big one, 1000 thanks !! One question, do you think it would be risky using a university vpn connection, to access this (or another site), using the above commands ? And maybe one more question, if you have the time and it's not too much bother, could you give some details on how you:"I regexed the hell out of it to assemble a list of URLs to each magazine issue.", as I am trying to do this in another site (for free content), and would this work in windows ? Last edited by bobodude; 08-29-2015 at 05:03 AM. |
![]() |
![]() |
Advert | |
|
![]() |
#21 |
Ex-Helpdesk Junkie
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 19,421
Karma: 85400180
Join Date: Nov 2012
Location: The Beaten Path, USA, Roundworld, This Side of Infinity
Device: Kindle Touch fw5.3.7 (Wifi only)
|
University VPN connections -- I can't imagine why they would have a problem, unless they have a rule against using too much bandwidth in a small span of time?
Assuming you tell wget to wait, in order to protect yourself from the wrath of the website's webmaster, you should be protected from the wrath of a university VPN admin. ![]() The basic idea is to always pretend to be the Firefox browser... nice and innocuous... and balance the need for immediacy with the need to stagger your downloads down to a human-compatible "click" rate. The two together means that in the event someone dislikes people using wget (which is not actually a given), then "better safe than sorry" -- they won't know anyway. As for how I regexed the hell out of it, it will be different for every site, the basic idea is to learn how to write regexes -- this website is the one that taught me, I like their noob-friendly explanations ![]() ![]() They can explain better than I can. Regular Expressions (regexes) are a powerful tool for cutting apart and putting back together text, and you'd be surprised at how useful they can be in general. e.g. LibreOffice allows regular expressions in Find and Replace. There are various programs that can perform regexes, the tools I happen to use are sed or from within vim --which probably won't help you much. ![]() A quick google search turns up several applications and sites that can regex text files or copy-pasted info. The makers of the above tutorial also have a regex program. In this specific case, I took a look at the HTML of the front page, found a loooooooooong block of text that had links to the separate issues, deleted everything above and below the block, and then ran a couple regexes that I didn't bother to remember, which progressively cleaned it until there was one plain URL per line. Last edited by eschwartz; 08-30-2015 at 01:37 AM. |
![]() |
![]() |
![]() |
#22 |
Connoisseur
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 70
Karma: 1800048
Join Date: Oct 2014
Device: BooX M96
|
Thanks for the additional info, I'll look into it !!!
And thanks again for all the tips you posted, I really learnt alot !!! and things I've been wanting to know for a while ... |
![]() |
![]() |
![]() |
#23 |
Ex-Helpdesk Junkie
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 19,421
Karma: 85400180
Join Date: Nov 2012
Location: The Beaten Path, USA, Roundworld, This Side of Infinity
Device: Kindle Touch fw5.3.7 (Wifi only)
|
Happy to be of service.
![]() And of course feel free to ask any questions you might have as you figure things out. |
![]() |
![]() |
![]() |
#24 |
Connoisseur
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 70
Karma: 1800048
Join Date: Oct 2014
Device: BooX M96
|
As a matter of fact, there was one more thing (I wasn't sure if I'd be asking for too much info ...),
to download articles through wget from the newleftreview, I need to be identified as being connected through a university internet connection, I was thinking of doing this through a wget proxy connection, and googled this and found a couple of wget commands, but haven't had any luck sofar, so if there is a command you know of that works for you, that would be great, or if you know of another way ... I have seen that one can alter the wgetrc file, to configure it to use a proxy server, however as mentioned earlier, I can't seem to find this file ... thanks again !!! |
![]() |
![]() |
![]() |
#25 | |
Ex-Helpdesk Junkie
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 19,421
Karma: 85400180
Join Date: Nov 2012
Location: The Beaten Path, USA, Roundworld, This Side of Infinity
Device: Kindle Touch fw5.3.7 (Wifi only)
|
Quote:
Although I already detailed how to masquerade as Firefox, so they would never know. ![]() But simply using the university proxy would be easier The standard way to use a proxy is to: Code:
set https_proxy=https://proxy.server.com ![]() Many programs know how to obey this environment variable, including wget. You can also specify a wgetrc file using Code:
wget --config C:\path\to\config\file [more options and websites and stuff] Last edited by eschwartz; 09-01-2015 at 05:12 PM. |
|
![]() |
![]() |
![]() |
#26 |
Connoisseur
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 70
Karma: 1800048
Join Date: Oct 2014
Device: BooX M96
|
Thanks again !!!
(no more questions for a while, promise ...) |
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Request Batch export annotated pdf | gotmilt | enTourage eDGe | 2 | 11-18-2011 04:57 PM |
PDF to prc/azw Batch Conversion | xsolitudex | 2 | 09-04-2010 10:19 AM | |
Classic Batch download of B&N eBooks? | mgmueller | Barnes & Noble NOOK | 5 | 02-08-2010 12:01 PM |
HTML to PDF batch converter | sputnik | 3 | 07-07-2009 04:25 AM |