View Full Version : Automating offline web content collection


travis
06-14-2007, 03:43 AM
I have been trying to find the best way to automate gathering information off the web for later offline reading (on a Nokia 800 internet tablet equipped with FBReader).

I will be traveling extensively for a long period of time with only occasional internet access for my Nokia 800. What I would like to do is automatically gather content from several web sites that I enjoy reading while I have internet access for reading later offline. For instance, a couple of newspapers and the Wall Street Journal (I am a subscriber). This would also reduce the amount of time I need to spend at internet cafes and let me catch up on reading during down time when I am traveling by bus or plane, during days when I am resting, on the beach, etc. (I will have a spare battery and an external charger). Paper books are not even an option because they are too heavy and bulky for a trip of this length and I will not be in any English speaking countries where I could purchase along the way.

I have searched here and found several possible solutions but I am not sure about the right one and they all require a considerable learning curve to even determine what they can do. I have briefly looked at Dapper.net, Sitescooper, Website Puller, etc., but I don't really know what is possible with them. Reading RSS Feeds offline really is not a solution for me since the feeds are so incomplete (although that does help me part way with a few blogs that I read).

My dream setup would be something that would automatically email me daily an HTML file for the web sites that I am interested in in a way that I could read offline later. Then whenever I pull down my emails on my Nokia 800 I would basically have lots of offline content to read and I wouldn't miss anything since the content is being archived daily.

Can anyone suggest a direction for me to research? I am mostly looking for high level direction. I don't even know where to start or if what I am looking for is even possible (within reason).

Thanks!
Travis

ashkulz
06-14-2007, 10:11 AM
If you know Python programming, you might want to use Scrape 'N' Feed (http://www.crummy.com/software/ScrapeNFeed/) to generate an RSS file from a website. There are many solutions to convert from RSS => HTML.

I tend to read mostly blogs subscribed via bloglines, so I tend to use a customized version (http://puggy.symonds.net/~ashish/downloads/) of bloglines2html (http://fucoder.com/code/bloglines2html/). That serves me for my needs.

travis
06-16-2007, 01:09 AM
If you know Python programming, you might want to use Scrape 'N' Feed (http://www.crummy.com/software/ScrapeNFeed/) to generate an RSS file from a website. There are many solutions to convert from RSS => HTML.

I tend to read mostly blogs subscribed via bloglines, so I tend to use a customized version (http://puggy.symonds.net/~ashish/downloads/) of bloglines2html (http://fucoder.com/code/bloglines2html/). That serves me for my needs. Thanks for the suggestion. Wow, this solution looks pretty complex, also. I don't know Python although I am a programmer.

And unfortunately I don't have a machine on which to run cron jobs during my long trip, although I am sure that I could beg off on one of my friends.

It seems like there is a large learning curve and some custom work involved to any approach to this problem. (which is something that I wanted to confirm with this post, to see if I was missing something)

I am figuring that I can do my most of blog reading via just reading my RSS reader on the N800 offline and refreshing the feeds whenever I have bandwidth. The blog writers seem to give complete feeds, although I miss out on the useful comments.

Travis