View Single Post
Old 10-07-2010, 04:44 PM   #10
nrapallo
GuteBook/Mobi2IMP Creator
nrapallo ought to be getting tired of karma fortunes by now.nrapallo ought to be getting tired of karma fortunes by now.nrapallo ought to be getting tired of karma fortunes by now.nrapallo ought to be getting tired of karma fortunes by now.nrapallo ought to be getting tired of karma fortunes by now.nrapallo ought to be getting tired of karma fortunes by now.nrapallo ought to be getting tired of karma fortunes by now.nrapallo ought to be getting tired of karma fortunes by now.nrapallo ought to be getting tired of karma fortunes by now.nrapallo ought to be getting tired of karma fortunes by now.nrapallo ought to be getting tired of karma fortunes by now.
 
nrapallo's Avatar
 
Posts: 2,958
Karma: 2530691
Join Date: Dec 2007
Location: Toronto, Canada
Device: REB1200 EBW1150 Device: T1 NSTG iLiad_v2 NC Device: Asus_TF Next1 WPDN
Quote:
Originally Posted by Starson17 View Post
I started with HTTrack, but found some things it wouldn't do well for me. It's been a long time, and perhaps it has been updated, but I switched to wget and have been happy with it. I use it daily/hourly/weekly to automatically grab certain files for my wife on those sites that want you to come back each day/hour/week for something free.
I think that's a fair assessment when the website is "troublesome" i.e. doesn't stay on the same URL path and/or goes off-domain with it's hyperlinks.

The aforementioned MIT Press website book was very well constructed and "behaved nicely" when being spidered so I didn't have much to worry about when using HTTrack. Using wget should also not have any issues.

There are some tricks/techniques I employ when dealing with a "poorly linked website" for spidering purposes, but they usually get used ONCE and then the project is spidered/over.

For some websites that I've spidered and converted to ebooks, in the past, see the bottom of this thread.
nrapallo is offline   Reply With Quote