Quote:
Originally Posted by Starson17
I started with HTTrack, but found some things it wouldn't do well for me. It's been a long time, and perhaps it has been updated, but I switched to wget and have been happy with it. I use it daily/hourly/weekly to automatically grab certain files for my wife on those sites that want you to come back each day/hour/week for something free.
|
I think that's a fair assessment when the website is "troublesome" i.e. doesn't stay on the same URL path and/or goes off-domain with it's hyperlinks.
The aforementioned MIT Press website book was very well constructed and "behaved nicely" when being spidered so I didn't have much to worry about when using HTTrack. Using wget should also not have any issues.
There are some tricks/techniques I employ when dealing with a "poorly linked website" for spidering purposes, but they usually get used ONCE and then the project is spidered/over.
For some websites that I've spidered and converted to ebooks, in the past, see the bottom of this
thread.