Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book General > General Discussions

Notices

Reply
 
Thread Tools Search this Thread
Old 08-24-2015, 11:02 PM   #16
eschwartz
Ex-Helpdesk Junkie
eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.
 
eschwartz's Avatar
 
Posts: 19,421
Karma: 85400180
Join Date: Nov 2012
Location: The Beaten Path, USA, Roundworld, This Side of Infinity
Device: Kindle Touch fw5.3.7 (Wifi only)
Show me an example page that lists all the PDFs you want.

Is the site you mentioned above http://newleftreview.org -- I don't see any sort of main index page.
eschwartz is offline   Reply With Quote
Old 08-25-2015, 03:54 PM   #17
bobodude
Connoisseur
bobodude ought to be getting tired of karma fortunes by now.bobodude ought to be getting tired of karma fortunes by now.bobodude ought to be getting tired of karma fortunes by now.bobodude ought to be getting tired of karma fortunes by now.bobodude ought to be getting tired of karma fortunes by now.bobodude ought to be getting tired of karma fortunes by now.bobodude ought to be getting tired of karma fortunes by now.bobodude ought to be getting tired of karma fortunes by now.bobodude ought to be getting tired of karma fortunes by now.bobodude ought to be getting tired of karma fortunes by now.bobodude ought to be getting tired of karma fortunes by now.
 
Posts: 70
Karma: 1800048
Join Date: Oct 2014
Device: BooX M96
Yes, that's the one,

hmm, I'm quite new to this, I mirrored part of the site, and it seems it is organized by issue, with html pages sorted accordingly, but I haven't had luck with the pdf's ...
bobodude is offline   Reply With Quote
Advert
Old 08-26-2015, 06:31 PM   #18
mitrodhar
Banned
mitrodhar ought to be getting tired of karma fortunes by now.mitrodhar ought to be getting tired of karma fortunes by now.mitrodhar ought to be getting tired of karma fortunes by now.mitrodhar ought to be getting tired of karma fortunes by now.mitrodhar ought to be getting tired of karma fortunes by now.mitrodhar ought to be getting tired of karma fortunes by now.mitrodhar ought to be getting tired of karma fortunes by now.mitrodhar ought to be getting tired of karma fortunes by now.mitrodhar ought to be getting tired of karma fortunes by now.mitrodhar ought to be getting tired of karma fortunes by now.mitrodhar ought to be getting tired of karma fortunes by now.
 
Posts: 8
Karma: 3234370
Join Date: Aug 2015
Device: Calibre
Quote:
Originally Posted by bobodude View Post
I've been looking around the web for a simple and fast way to download many pdf's from a website, but haven't found a solution I am able to figure out,

does anyone know of a simple way to do this ?

Thnaks !

Where from? I suppose the zotero plugin may help:
https://www.zotero.org/download/
mitrodhar is offline   Reply With Quote
Old 08-27-2015, 02:14 AM   #19
eschwartz
Ex-Helpdesk Junkie
eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.
 
eschwartz's Avatar
 
Posts: 19,421
Karma: 85400180
Join Date: Nov 2012
Location: The Beaten Path, USA, Roundworld, This Side of Infinity
Device: Kindle Touch fw5.3.7 (Wifi only)
Quote:
Originally Posted by bobodude View Post
Yes, that's the one,

hmm, I'm quite new to this, I mirrored part of the site, and it seems it is organized by issue, with html pages sorted accordingly, but I haven't had luck with the pdf's ...
Well, they don't have an index of the PDFs available, which makes it difficult. However, each page does link to the full list of subscriptions.

And I notice that they are behind a paywall, which means you will have to login, always fun with scripts.

To login, you will need the following command:
Code:
wget --save-cookies=cookies.txt --keep-session-cookies --post-data="username=MYUSERNAME&password=MYPASSWORD" --user-agent "Mozilla/5.0 (Windows NT 6.1; rv:40.0) Gecko/20100101 Firefox/40.0" https://newleftreview.org/customer/do_login --delete-after
Replace MYUSERNAME and MYPASSWORD with the obvious.

How to figure out the right command:
Spoiler:

By examining the login page using Firefox's Inspect Element, I saw the login form -- it "POST"s data to "https://newleftreview.org/customer/do_login", and uses fields with the names "username" and "password".
So, there are the values for `--post-data`.
--save-cookies and --keep-session-cookies ensures that the necessary web cookies with the login data are saved to the file cookies.txt
--user-agent makes wget pretend to be Firefox 40, running on Windows 7
--delete-after deletes the downloaded login page, which we don't want to keep.


To download all files:
Code:
wget --trust-server-names --content-disposition --load-cookies cookies.txt --recursive --level 3 --accept pdf,PDF --wait 5 --random-wait --user-agent "Mozilla/5.0 (Windows NT 6.1; rv:40.0) Gecko/20100101 Firefox/40.0" https://newleftreview.org
What this does:
Spoiler:

--trust-server-names makes wget use the filename of a redirect when saving.
--content-disposition makes wget use the Content-Disposition header to determine the filename to save as.
Only one of these is necessary, but depending on the website and their download mechanism, it might be either one.

--load-cookies will make wget read the saved login info from the cookies.txt file we saved in the last step.
--recursive means we follow all links and download those too.
--level makes wget follow links 3 times. The first time downloads each magazine issue, the second time downloads the articles linked from each issue, the third time downloads the PDF linked from the article.
--wait tells wget to take a five-second break in between downloads
--random-wait makes wget average out the above wait, between 2.5 and 7.5 seconds
--user-agent makes wget pretend to be Firefox 40, running on Windows 7


Notice the wait. This is because it is really impolite to hit up a website with hundreds of requests while you try to download every page on their website. Also they have your login information so they may ban you.
The wait is random, in order to confuse them. Additionally, wget pretends to be Firefox.


I just want to point out, this will by necessity take some time, but at least you don't have to pay attention, and do a lot of clicking. Just run wget overnight, or while you go out for the day, or just run it and pay attention to something else.
Also, hitting up websites in an automated way doesn't usually make them thrilled. I have included the configuration options which make wget go slowly and pretend to not be an automated tool, which I consider a wise move -- it severely lessens the chance that a smart webmaster will notice you and cancel your account and ip-block you and send their SWAT team after you, your family, and everyone you know just kidding.


This may or may not be worth it to you, but at the very least, it should be instructive.






EDIT: In order to save a bit on the recursion counts, I downloaded the front page, and did some magic with an old CLI favorite, "sed". Basically, I regexed the hell out of it to assemble a list of URLs to each magazine issue.
I'm afraid, since every page has a full list of issues, you will still have to follow 328 links on each page, then follow them again. But at least it only costs time.
RAM usage might get ridiculous as it tries to keep track of all those recursive urls. Or it might not, since I assume the wget developers had that concern when they created a recursive mode.
I have never tried large feats, so I wouldn't know.


Download the attached index.txt and open a command prompt in the same directory as the index.txt -- then use this command:

Code:
wget --trust-server-names --content-disposition --load-cookies cookies.txt --recursive --level 2 --accept pdf,PDF --wait 5 --random-wait --user-agent "Mozilla/5.0 (Windows NT 6.1; rv:40.0) Gecko/20100101 Firefox/40.0" --input-file index.txt
I dropped the --level to 2, and used --input-file to download a list of webpages from the index.txt file.
Attached Files
File Type: txt index.txt (10.2 KB, 119 views)

Last edited by eschwartz; 08-27-2015 at 02:45 AM.
eschwartz is offline   Reply With Quote
Old 08-29-2015, 04:21 AM   #20
bobodude
Connoisseur
bobodude ought to be getting tired of karma fortunes by now.bobodude ought to be getting tired of karma fortunes by now.bobodude ought to be getting tired of karma fortunes by now.bobodude ought to be getting tired of karma fortunes by now.bobodude ought to be getting tired of karma fortunes by now.bobodude ought to be getting tired of karma fortunes by now.bobodude ought to be getting tired of karma fortunes by now.bobodude ought to be getting tired of karma fortunes by now.bobodude ought to be getting tired of karma fortunes by now.bobodude ought to be getting tired of karma fortunes by now.bobodude ought to be getting tired of karma fortunes by now.
 
Posts: 70
Karma: 1800048
Join Date: Oct 2014
Device: BooX M96
mithrodar,

yes I am also using zotero, and it is great for certain websites, for example it renames academic pdf's, but I can't get it to work on Proquest, whic is a shame, as it would give me an option to batch download pdf's easily ...

eschwartz,

WOW, thanks for the detailed (and noob firendly) reply, I will try and get it to work for me, thanks for all the great tips !!!

I will tell you, if i get it to work, ...

OMG, I'm using your commands on another website, and it's working like magic !!
I owe you a big one, 1000 thanks !!

One question, do you think it would be risky using a university vpn connection, to access this (or another site), using the above commands ?

And maybe one more question, if you have the time and it's not too much bother, could you give some details on how you:"I regexed the hell out of it to assemble a list of URLs to each magazine issue.",
as I am trying to do this in another site (for free content), and would this work in windows ?

Last edited by bobodude; 08-29-2015 at 05:03 AM.
bobodude is offline   Reply With Quote
Advert
Old 08-30-2015, 01:22 AM   #21
eschwartz
Ex-Helpdesk Junkie
eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.
 
eschwartz's Avatar
 
Posts: 19,421
Karma: 85400180
Join Date: Nov 2012
Location: The Beaten Path, USA, Roundworld, This Side of Infinity
Device: Kindle Touch fw5.3.7 (Wifi only)
University VPN connections -- I can't imagine why they would have a problem, unless they have a rule against using too much bandwidth in a small span of time?
Assuming you tell wget to wait, in order to protect yourself from the wrath of the website's webmaster, you should be protected from the wrath of a university VPN admin.
The basic idea is to always pretend to be the Firefox browser... nice and innocuous... and balance the need for immediacy with the need to stagger your downloads down to a human-compatible "click" rate. The two together means that in the event someone dislikes people using wget (which is not actually a given), then "better safe than sorry" -- they won't know anyway.

As for how I regexed the hell out of it, it will be different for every site, the basic idea is to learn how to write regexes -- this website is the one that taught me, I like their noob-friendly explanations http://regular-expressions.info
They can explain better than I can.
Regular Expressions (regexes) are a powerful tool for cutting apart and putting back together text, and you'd be surprised at how useful they can be in general. e.g. LibreOffice allows regular expressions in Find and Replace.

There are various programs that can perform regexes, the tools I happen to use are sed or from within vim --which probably won't help you much.
A quick google search turns up several applications and sites that can regex text files or copy-pasted info. The makers of the above tutorial also have a regex program.

In this specific case, I took a look at the HTML of the front page, found a loooooooooong block of text that had links to the separate issues, deleted everything above and below the block, and then ran a couple regexes that I didn't bother to remember, which progressively cleaned it until there was one plain URL per line.

Last edited by eschwartz; 08-30-2015 at 01:37 AM.
eschwartz is offline   Reply With Quote
Old 08-31-2015, 05:52 PM   #22
bobodude
Connoisseur
bobodude ought to be getting tired of karma fortunes by now.bobodude ought to be getting tired of karma fortunes by now.bobodude ought to be getting tired of karma fortunes by now.bobodude ought to be getting tired of karma fortunes by now.bobodude ought to be getting tired of karma fortunes by now.bobodude ought to be getting tired of karma fortunes by now.bobodude ought to be getting tired of karma fortunes by now.bobodude ought to be getting tired of karma fortunes by now.bobodude ought to be getting tired of karma fortunes by now.bobodude ought to be getting tired of karma fortunes by now.bobodude ought to be getting tired of karma fortunes by now.
 
Posts: 70
Karma: 1800048
Join Date: Oct 2014
Device: BooX M96
Thanks for the additional info, I'll look into it !!!

And thanks again for all the tips you posted, I really learnt alot !!!

and things I've been wanting to know for a while ...
bobodude is offline   Reply With Quote
Old 08-31-2015, 05:59 PM   #23
eschwartz
Ex-Helpdesk Junkie
eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.
 
eschwartz's Avatar
 
Posts: 19,421
Karma: 85400180
Join Date: Nov 2012
Location: The Beaten Path, USA, Roundworld, This Side of Infinity
Device: Kindle Touch fw5.3.7 (Wifi only)
Happy to be of service.

And of course feel free to ask any questions you might have as you figure things out.
eschwartz is offline   Reply With Quote
Old 09-01-2015, 11:54 AM   #24
bobodude
Connoisseur
bobodude ought to be getting tired of karma fortunes by now.bobodude ought to be getting tired of karma fortunes by now.bobodude ought to be getting tired of karma fortunes by now.bobodude ought to be getting tired of karma fortunes by now.bobodude ought to be getting tired of karma fortunes by now.bobodude ought to be getting tired of karma fortunes by now.bobodude ought to be getting tired of karma fortunes by now.bobodude ought to be getting tired of karma fortunes by now.bobodude ought to be getting tired of karma fortunes by now.bobodude ought to be getting tired of karma fortunes by now.bobodude ought to be getting tired of karma fortunes by now.
 
Posts: 70
Karma: 1800048
Join Date: Oct 2014
Device: BooX M96
As a matter of fact, there was one more thing (I wasn't sure if I'd be asking for too much info ...),

to download articles through wget from the newleftreview, I need to be identified as being connected through a university internet connection,

I was thinking of doing this through a wget proxy connection, and googled this and found a couple of wget commands, but haven't had any luck sofar,

so if there is a command you know of that works for you, that would be great,
or if you know of another way ...

I have seen that one can alter the wgetrc file, to configure it to use a proxy server, however as mentioned earlier, I can't seem to find this file ...

thanks again !!!
bobodude is offline   Reply With Quote
Old 09-01-2015, 05:07 PM   #25
eschwartz
Ex-Helpdesk Junkie
eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.
 
eschwartz's Avatar
 
Posts: 19,421
Karma: 85400180
Join Date: Nov 2012
Location: The Beaten Path, USA, Roundworld, This Side of Infinity
Device: Kindle Touch fw5.3.7 (Wifi only)
Quote:
to download articles through wget from the newleftreview, I need to be identified as being connected through a university internet connection,
What do you mean by this? Does the website allow people to pull articles via wget for educational purposes? Good news!
Although I already detailed how to masquerade as Firefox, so they would never know.



But simply using the university proxy would be easier

The standard way to use a proxy is to:

Code:
set https_proxy=https://proxy.server.com
Or otherwise set the https_proxy permanently through the Environment Variables dialog thingy available somewhere in the Start Menu searchbox. Or the far more usable Rapid Environment Editor.


Many programs know how to obey this environment variable, including wget.


You can also specify a wgetrc file using
Code:
wget --config C:\path\to\config\file   [more options and websites and stuff]
You can use the "wgetrc" envrionment variable to specify the location of your wgetrc file (on linux it automatically looks in $HOME/.wgetrc but I cannot find any comparable location for Windows).

Last edited by eschwartz; 09-01-2015 at 05:12 PM.
eschwartz is offline   Reply With Quote
Old 09-04-2015, 06:32 AM   #26
bobodude
Connoisseur
bobodude ought to be getting tired of karma fortunes by now.bobodude ought to be getting tired of karma fortunes by now.bobodude ought to be getting tired of karma fortunes by now.bobodude ought to be getting tired of karma fortunes by now.bobodude ought to be getting tired of karma fortunes by now.bobodude ought to be getting tired of karma fortunes by now.bobodude ought to be getting tired of karma fortunes by now.bobodude ought to be getting tired of karma fortunes by now.bobodude ought to be getting tired of karma fortunes by now.bobodude ought to be getting tired of karma fortunes by now.bobodude ought to be getting tired of karma fortunes by now.
 
Posts: 70
Karma: 1800048
Join Date: Oct 2014
Device: BooX M96
Thanks again !!!

(no more questions for a while, promise ...)
bobodude is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Request Batch export annotated pdf gotmilt enTourage eDGe 2 11-18-2011 04:57 PM
PDF to prc/azw Batch Conversion xsolitudex PDF 2 09-04-2010 10:19 AM
Classic Batch download of B&N eBooks? mgmueller Barnes & Noble NOOK 5 02-08-2010 12:01 PM
HTML to PDF batch converter sputnik PDF 3 07-07-2009 04:25 AM


All times are GMT -4. The time now is 01:53 PM.


MobileRead.com is a privately owned, operated and funded community.