Full Newspaper - The Christain Science Monitor

heavyB · 11-01-2006, 08:37 PM

I love rss2book that geekraver put together. But I kept longing to read the whole article, not just RSS feeds. The idea of downloading a full newspaper and reading over a cup of coffee is the ultimate Sunday morning for me. I've found a way to download one of my favorite papers online, highly readable, with a table of contents and ready in 2 minutes.

After searching for RSS feeds that delivered full news articles (I found none) I found one of my favorite newspapers offers a "text version" of their site. This is my round about way to get the whole content of the online version of The Christian Science Monitor on my Sony Reader. Feel free to offer suggestions, I am by no means a programmer, just a hack. Of course this is all for naught if everyone out there but me knows of full news text sites in RSS

I'm using Windows XP, Firefox (I'm using 2.0), HTMLdoc (see this post) and TextPad (free @ http://www.textpad.com )

Download the attached "getcs.bat" file and place it in your directory where you have HTMLdoc installed (usually c:\program files\HTMLdoc).

Open Firefox and browse to: http://www.csmonitor.com/cgi-bin/red...pl?textEdition

Right click over the newly loaded page and select "View Page Info" This will popup a dialog box with tabs along the top, select the "Links" tab. This displays all available links of this page. Drag this window a bit bigger so you can see what you've got going on in here. You'll notice where the category links end and the stories begin. The articles all have a year in the URL like this: http://www.csmonitor.com/2006/1102/p13s01-lign.htm . I select the first link by clicking on it, then scroll to the end of the article link list and while holding the [Shift] key on my keyboard, I click the last article. You should have all links that are articles selected. Right click on this selection and left click on "Copy".

Open the "getcs.bat' file you downloaded from the link below with textPad. You'll see "[PASTE CSMONITOR LINKS HERE]" in the text. delete this, leaving the space after the text "http://www.csmonitor.com/cgi-bin/redirect.pl?textEdition". Here is where we paste the links from Firefox by right clicking in textPad and left clicking "Paste".

Right click again and select "reformat" This is important, it strips the return characters from the firefox link paste. (you may need to have wordwrap on in textPad to see what you're doing, which is fine, wordwrap has no effect on the saved file)

Save this edited file by clicking "File" and "save" from the top left of the textPad. If you've ever seen the config file for rss2book, you'll see here I borrow heavily from Geekraver for my HTMLdoc settings.

To run, double click "getcs.bat". A quick warning regarding .bat files by the way. Malicious folks can put nasty thing in these files and you should never run one without viewing it first. You can see in this .bat file, the only file being run is htmlDoc. It should create a 'csmonitor.pdf' file in same directory your htmlDoc is installed in.

That should do it. Excuse me if I rambled, was too simplistic or not explanitive enough.

I've attached the csget.bat file, a csgetSample.bat you can copy and run right away, and a sample csmonitor.pdf.

geekraver · 11-01-2006, 09:21 PM

Actually I've been thinking about doing something vaguely like this. I recently saw a set of scripts someone put together to pull wikipedia content together (kind of like web spidering starting with a small set of articles) and putting this on an iPod. I thought that would be cool to do, and ultimately should be generalized. One approach wrt RSS would be to flag each feed as being complete or partial, and in the case of partial ones following the links to the full text. This might require some addtional configuration to know how to extract the signal from the noise of the full pages but in many cases this could be done just scanning for appropriate DIV tag class attributes.

neilm2 · 11-02-2006, 12:31 AM

You rock, Heavy B! Now I'm searching around for other newspapers that offer text-only versions that work with this .bat file.

neilm2 · 11-02-2006, 12:37 AM

BBC News works pretty well...
http://news.bbc.co.uk/2/low/default.stm

heavyB · 11-02-2006, 01:47 AM

Thanks Neilm2 and nice find on that BBC link. They're few and far between. The New York Times has a nice print only format, but only after loading the full story page (no index).

Geekraver, I fully agree with what you're saying here. Basically a scraper app or scraping service with individual profiles for different web sites. Heh, an online scraping service wouldn't last long, but if we had a combo app that offered updated profiles of web sites via online service that would work with an app like your rss2book (scrape2book?) there wouldn't be much trouble with getting shutdown by the ad mongers (the real reason sites don't serve full text RSS or offer text only services)

I'm not too shabby at Web app dev (mostly CFML) and parsing, but I have little to no stand alone app dev experience. This of course should be moved to the dev subcategory in the forum. I'd be interested in discussing it futher.

geekraver · 11-02-2006, 04:20 AM

Okay, I'm about to post an updated version of rss2book. It certainly doesn't do everything but it has enough added functionality that you can make a nice PDF of BBC news.

neilm2 · 11-02-2006, 12:21 PM

That's great, Geekraver! I'm looking forward to it.

geekraver · 11-04-2006, 04:52 AM

It's not the full paper; just the main world news, but import the XML file below into rss2book release 7 and you're on your way!

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Seriously thoughtful When science fiction meets science fact	pilotbob	Lounge	51	04-25-2009 04:30 PM
Christian Science Monitor labels Kindle a ‘Trojan horse’	dreams	News	72	03-22-2009 04:24 PM
Christian Science Monitor has article about e-books	Liviu_5	News	0	10-20-2007 11:29 PM
Soft on the Science - Science Fiction	Domokos	Reading Recommendations	0	01-29-2006 10:18 PM

11-01-2006, 09:21 PM	#2
geekraver Addict Posts: 364 Karma: 1035291 Join Date: Jul 2006 Location: Redmond, WA Device: iPad Mini,Kindle Paperwhite	Actually I've been thinking about doing something vaguely like this. I recently saw a set of scripts someone put together to pull wikipedia content together (kind of like web spidering starting with a small set of articles) and putting this on an iPod. I thought that would be cool to do, and ultimately should be generalized. One approach wrt RSS would be to flag each feed as being complete or partial, and in the case of partial ones following the links to the full text. This might require some addtional configuration to know how to extract the signal from the noise of the full pages but in many cases this could be done just scanning for appropriate DIV tag class attributes.

11-02-2006, 12:31 AM	#3
neilm2 Enthusiast Posts: 35 Karma: 12 Join Date: Oct 2006 Device: Amazon Kindle, Sony Reader	You rock, Heavy B! Now I'm searching around for other newspapers that offer text-only versions that work with this .bat file.

11-02-2006, 12:37 AM	#4
neilm2 Enthusiast Posts: 35 Karma: 12 Join Date: Oct 2006 Device: Amazon Kindle, Sony Reader	BBC News works pretty well... http://news.bbc.co.uk/2/low/default.stm

11-02-2006, 01:47 AM	#5
heavyB Member Posts: 23 Karma: 47 Join Date: Oct 2006 Device: Sony Reader/Treo 600	Thanks Neilm2 and nice find on that BBC link. They're few and far between. The New York Times has a nice print only format, but only after loading the full story page (no index). Geekraver, I fully agree with what you're saying here. Basically a scraper app or scraping service with individual profiles for different web sites. Heh, an online scraping service wouldn't last long, but if we had a combo app that offered updated profiles of web sites via online service that would work with an app like your rss2book (scrape2book?) there wouldn't be much trouble with getting shutdown by the ad mongers (the real reason sites don't serve full text RSS or offer text only services) I'm not too shabby at Web app dev (mostly CFML) and parsing, but I have little to no stand alone app dev experience. This of course should be moved to the dev subcategory in the forum. I'd be interested in discussing it futher.

11-02-2006, 04:20 AM	#6
geekraver Addict Posts: 364 Karma: 1035291 Join Date: Jul 2006 Location: Redmond, WA Device: iPad Mini,Kindle Paperwhite	Okay, I'm about to post an updated version of rss2book. It certainly doesn't do everything but it has enough added functionality that you can make a nice PDF of BBC news.

11-02-2006, 12:21 PM	#7
neilm2 Enthusiast Posts: 35 Karma: 12 Join Date: Oct 2006 Device: Amazon Kindle, Sony Reader	That's great, Geekraver! I'm looking forward to it.

Advert

Advert