MobileRead Forums - View Single Post

eschwartz · 08-30-2015, 01:22 AM

University VPN connections -- I can't imagine why they would have a problem, unless they have a rule against using too much bandwidth in a small span of time?
Assuming you tell wget to wait, in order to protect yourself from the wrath of the website's webmaster, you should be protected from the wrath of a university VPN admin.

The basic idea is to always pretend to be the Firefox browser... nice and innocuous... and balance the need for immediacy with the need to stagger your downloads down to a human-compatible "click" rate. The two together means that in the event someone dislikes people using wget (which is not actually a given), then "better safe than sorry" -- they won't know anyway.

As for how I regexed the hell out of it, it will be different for every site, the basic idea is to learn how to write regexes -- this website is the one that taught me, I like their noob-friendly explanations

http://regular-expressions.info
They can explain better than I can.
Regular Expressions (regexes) are a powerful tool for cutting apart and putting back together text, and you'd be surprised at how useful they can be in general. e.g. LibreOffice allows regular expressions in Find and Replace.

There are various programs that can perform regexes, the tools I happen to use are sed or from within vim --which probably won't help you much.

A quick google search turns up several applications and sites that can regex text files or copy-pasted info. The makers of the above tutorial also have a regex program.

In this specific case, I took a look at the HTML of the front page, found a loooooooooong block of text that had links to the separate issues, deleted everything above and below the block, and then ran a couple regexes that I didn't bother to remember, which progressively cleaned it until there was one plain URL per line.

08-30-2015, 01:22 AM	#21
eschwartz Ex-Helpdesk Junkie Posts: 19,421 Karma: 85400180 Join Date: Nov 2012 Location: The Beaten Path, USA, Roundworld, This Side of Infinity Device: Kindle Touch fw5.3.7 (Wifi only)	University VPN connections -- I can't imagine why they would have a problem, unless they have a rule against using too much bandwidth in a small span of time? Assuming you tell wget to wait, in order to protect yourself from the wrath of the website's webmaster, you should be protected from the wrath of a university VPN admin. The basic idea is to always pretend to be the Firefox browser... nice and innocuous... and balance the need for immediacy with the need to stagger your downloads down to a human-compatible "click" rate. The two together means that in the event someone dislikes people using wget (which is not actually a given), then "better safe than sorry" -- they won't know anyway. As for how I regexed the hell out of it, it will be different for every site, the basic idea is to learn how to write regexes -- this website is the one that taught me, I like their noob-friendly explanations http://regular-expressions.info They can explain better than I can. Regular Expressions (regexes) are a powerful tool for cutting apart and putting back together text, and you'd be surprised at how useful they can be in general. e.g. LibreOffice allows regular expressions in Find and Replace. There are various programs that can perform regexes, the tools I happen to use are sed or from within vim --which probably won't help you much. A quick google search turns up several applications and sites that can regex text files or copy-pasted info. The makers of the above tutorial also have a regex program. In this specific case, I took a look at the HTML of the front page, found a loooooooooong block of text that had links to the separate issues, deleted everything above and below the block, and then ran a couple regexes that I didn't bother to remember, which progressively cleaned it until there was one plain URL per line. Last edited by eschwartz; 08-30-2015 at 01:37 AM.