|
|
View Full Version : retrieving printer-friendly versions...
goducks 08-24-2006, 12:14 PM I've been trying to figure this out for a week to no avail so if anyone can help, I'd be really grateful. There are several writers on Sports Illustrated online you can get RSS feeds for, but SI.com spreads the articles out over 3 or 4 pages, so I'd like to nab just the printer-friendly version, but I can't seem to configure Sunrise XP to do it correctly.
For example, Dr Z's feed is at http://rss.cnn.com/rss/si_dr_z.rss
All of the articles off it start with http://sportsillustrated.cnn.com/2006/writers/dr_z*
and on the first page of each is a link to a single-page printer friendly version that begins http://si.printthis.clickability.com/*
How in the world do I get Sunrise to get that version?
I think you're out of luck on this one. They've gone to extreme lengths to make it impossible--as far as I can see--to identify the link to the printable version.
If you open the "printable" window, right-click and look at the page properties, you'll see that its URL is very long and very complex. It includes a phrase that is not part of the original page and also includes an eight-digit number that is not found anywhere in the source on the original page. If that information isn't there, then there is no way Sunrise is going to find it.
But your problem is even worse. You need to be able to construct the "printable" link not from the information in the main article page, but rather from just the information on the RSS page you're starting with. That means that the information that uniquely identifies the printable version must be in the link you start with. It's just not there. Sorry.
goducks 08-25-2006, 10:14 AM Thanks DTM. I was starting to guess that myself. I hadn't realized that Sunrise had to reconstruct the link from the RSS feed address, though now that makes sense. I had thought we were somehow training it to find the printer friendly version off of the articles themselves, the Wiki training article didn't make that clear.
PippoPippini 09-21-2006, 11:05 AM I think you're out of luck on this one. They've gone to extreme lengths to make it impossible--as far as I can see--to identify the link to the printable version.
If you open the "printable" window, right-click and look at the page properties, you'll see that its URL is very long and very complex. It includes a phrase that is not part of the original page and also includes an eight-digit number that is not found anywhere in the source on the original page. If that information isn't there, then there is no way Sunrise is going to find it.
But your problem is even worse. You need to be able to construct the "printable" link not from the information in the main article page, but rather from just the information on the RSS page you're starting with. That means that the information that uniquely identifies the printable version must be in the link you start with. It's just not there. Sorry.
Hi.
I think to have a similar problem with RSS feeds from Reuters.
Articles linked from RSS feed are divided in multiple pages. There is a link to a printable version, but it is in a pop-up, with a sintax that use a string of text used in the article's URL.
Analyzing Bloomberg RSS feeds, I think that probably it's possible to link easily the printable page, because the printable link has only a "#" at the end.
I also analyzed the feed of Washington Post.
In the RSS feed links are like this:
http://www.washingtonpost.com/wp-dyn/content/article/2006/09/20/AR2006092001064.html?nav=rss_business
The printable one is:
http://www.washingtonpost.com/wp-dyn/content/article/2006/09/20/AR2006092001064_pf.html
The referring to the article ends with "_pf", that has to be included before the ".html" of the main article URL.
If there's someone interested linking these feeds, can help me writing a regular expression for these two feeds ?
I also download feed from one of the major italian newspaper, Corriere della Sera. Their printable link it's only without a "s" in the final ".shtml" extension of the URL. If I learn well how to rewrite links ...
Bye
Pippo
PippoPippini 09-29-2006, 10:10 AM ...
Analyzing Bloomberg RSS feeds, I think that probably it's possible to link easily the printable page, because the printable link has only a "#" at the end.
...
HI.
I tried rewriting link of Bloomberg`s feed.
The link filter is http://www\.bloomberg\.com(.*), while the rewrite rule I wrote is http://www.bloomberg.com$1#
But it doesn`t work. What`s wrong ?
G.
I haven't forgotten you!
This one's a bit tougher than most, but I think I have it. I just want to do some more testing and will then post the answer.
(They use some code numbers that I'm afraid might change from day to day, so I don't want to give you a "solution" that will fail tomorrow.)
I tried rewriting link of Bloomberg`s feed.
The link filter is http://www\.bloomberg\.com(.*), while the rewrite rule I wrote is http://www.bloomberg.com$1#
But it doesn`t work. What`s wrong ?
Got it!
Your rule doesn't work because the printable page is not just the original URL with a # appended. That's what you see when you hover on the link, but when you actually click the link, it runs a Java Script routine that constructs the real URL.
This is a good example to use to show how to crack the more advanced problems. To find the real URL for the printable version, click the link for the printer version. When the printer-friendly box opens, right-click on it and select Properties from Internet Explorer or View Page Info from Firefox. That will give you the actual URL for the page.
The example I used was the Economy RSS feed. One of the article pages was:
http://www.bloomberg.com/apps/news=?pid=20601068&sid=avAOrwcRZaAU&refer=economy
The printable version was:
http://www.bloomberg.com/apps/news=?pid=20670001&refer=economy&sid=avAOrwcRZaAU
Comparing the two, we see that there are two changes: the pid number is different and the "sid" section has swapped places with the "refer" section.
We are fortunate because it turns out that all of the printable pages for all of the feeds have a pid of 20670001. Our rewrite rule, then, must change the pid number and move the sid segment to the end.
The filter looks like this:
http://www\.bloomberg\.com/apps/news\?pid=.*&sid=(.*)&refer=economy
Notice that the first .* is not in parentheses, because we don't need to save the value; the number changes in the rewrite. The second is needed, however. Notice also that you must backslash periods and question marks.
The rewritten expression looks like this:
http://www.bloomberg.com/apps/news?pid=20670001&refer=economy&sid=$1
It is identical up to the pid, where we substitute the new number 20670001, then we follow that with the "refer" segment and end with the "sid" segment. The $1 inserts the sid code that we captured from the original link.
I checked only a couple of the feeds, but this appears to work for all of them. The only variation from one feed to another is that the "refer" part changes from "economy" to "politics", etc.
Enjoy!
----------------Edit------------------
It occurred to me that you may want to use a variation on the above, as follows:
Filter:
http://www\.bloomberg\.com/apps/news\?pid=.*&sid=(.*)&refer=(.*)
Rewrite:
http://www.bloomberg.com/apps/news?pid=20670001&refer=$2&sid=$1
This allows you to use exactly the same rule for all of the feeds, making it easier to copy and modify the Sunrise documents for each feed.
PippoPippini 10-10-2006, 09:17 AM Got it!
...
GREAT !!
It works !
I don't know how to thank your efforts.
I'm now trying to elaborate other newspaper sites: it's the only way I know to make my contribution.
Bye
Gaetano
PippoPippini 10-10-2006, 11:50 AM Ok, here 2 new rewrite rules for other newspapers.
Washington Post:
Pattern: http://www\.washingtonpost\.com(.*)\.html(.*)
Rewrite rule: http://www.washingtonpost.com$1_pf.html
Corriere della Sera (Italian newspaper)
Pattern: http://www\.corriere\.it(.*)shtml
Rewrite rule: http://www.corriere.it$1html
Now I'm analyzing International Herald Tribune and Reuters
With IHT I tried this rule:
Pattern: http://www\.iht\.com(.*)
Rewrite rule: http://www.iht.com/bin/print_ipub.php?file=$1
But it doesn't work: articles downloaded have only title, author and date.
See later
Gaetano
doctorow 10-10-2006, 12:02 PM You should probably use http://www\.iht\.com/articles/(.*) as pattern, and then rewrite with http://www.iht.com/bin/print_ipub.php?file=/articles/$1
PippoPippini 10-11-2006, 02:55 AM You should probably use http://www\.iht\.com/articles/(.*) as pattern, and then rewrite with http://www.iht.com/bin/print_ipub.php?file=/articles/$1
I tried, but still doesn't work.
The "/articles/" pattern doesn't change behaviour of the filter.
I'll continue to investigate.
Someone is interested in Reuters feed ?
This is the pattern rule:
http://today\.reuters\.com(.*)type=businessNews&story(.*)
and this is the rewrite rule: http://today.reuters.com/misc/PrinterFriendlyPopup.aspx?type=businessNews&story$2
Enjoy
Gaetano
Laurens 10-11-2006, 03:08 AM With IHT I tried this rule:
Pattern: http://www\.iht\.com(.*)
Rewrite rule: http://www.iht.com/bin/print_ipub.php?file=$1
But it doesn't work: articles downloaded have only title, author and date.
This pattern is correct, but unfortunately theTidy HTML parser that Sunrise XP uses can't handle this page. It has trouble dealing with inline JavaScript sometimes.
As a fallback, you can use http://mobile.iht.com/
PippoPippini 10-11-2006, 08:06 AM As a fallback, you can use http://mobile.iht.com/
Well, it works.
I rewrite the pattern rule so: http://www\.iht\.com(.*)php
The rewrite rule is: http://mobile.iht.com/$1xhtml
I continue using links in the RSS feed, translating to pages referred by the mobile version of the site.
Linkink directly the site http://mobile.iht.com/ and download from here the articles is not useful if one person wants articles from sections not referred in the sections of the Mobile home page, but present only in the second page.
Thanks for your help Laurens.
Gaetano
The Collector 10-14-2006, 10:09 AM Been trying to crunch out NYT by brute force, but I'm being rather unsuccessful. First post dolt here. :/
(http://www\.nytimes\.com)/(2006)/(..)/(..)/(world)/(africa|americas|asia|europe|middleeast)/(.*\.html?ref=world)
Rewrite:
$1/$2/$3/$4/$5/$6/$7&pagewanted=print
With my luck it's something simple. I'm pretty sure that many of the rewrites are redundant, and I should attempt to clean it up, but it would seem futile if I can't even get it to work properly in the first place.
Laurens 10-14-2006, 10:13 AM For NYT feeds and rewriting patterns, look in the Showcase (http://www.sunrisexp.com/download/)
Pattern: http://.*\.?nytimes\.com/.*\.html?.*
Rewrite: $&&pagewanted=print
PippoPippini 10-16-2006, 10:16 AM Hi
Here is another rule for Financial Times:
Pattern: http://www\.ft\.com(.*)_i_rssPage=(.*)
Rewrite rule: http://www.ft.com$1print=yes.html
Enjoy with the non subscription articles.
Now I'm working with Economist feed.
I put these rules.
Pattern: http://www\.economist\.com/printedition/displayStory.cfm?(.*)&fsrc=RSS
Rewrite rule: http://www.economist.com/printedition/PrinterFriendly.cfm?$1
But it doesn't work. Is the "&" a metacharacter ? Some hint ?
Bye
Gaetano
The ? is a metacharacter. You must use \? to get a literal ?
PippoPippini 10-17-2006, 06:27 AM The ? is a metacharacter. You must use \? to get a literal ?
HI DTM.
I modified the pattern so:
Pattern: http://www\.economist\.com/printedition/displayStory.cfm\?(.*)&fsrc=RSS
Rewrite rule: http://www.economist.com/printedition/PrinterFriendly.cfm?$1
but it doesn't work
I tried also in removing the "?", but it doesn't work.
I can't view directly the xml content of the feed, because The Economist use Pheedo in reconstructing links. So I have the suspect that the format of the links is different from the ones linked by Pheedo.
Can someone analyze the Economist feed ?
http://www.economist.com/rss/printedition/economist_printedition.xml
Here is another newspaper: Il sole 24 Ore (The major italian financial newspaper)
Pattern: http://www\.ilsole24ore\.com(.*).shtml?(.*)
Rewrite rule: http://www.ilsole24ore.com$1_PRN.shtml
Bye
Gaetano
Laurens 10-17-2006, 06:58 AM The Economist print edition gives "403 Economist.com automatic downloading forbidden" (see Update report), so it won't ever work.
PippoPippini 10-17-2006, 07:28 AM The Economist print edition gives "403 Economist.com automatic downloading forbidden" (see Update report), so it won't ever work.
Hi Laurens.
I'm a subscriber of the Economist, so probably due to the cookie, I can download the articles. But I always download the "normal" articles instead of printer friendly version.
Now I try canceling my cookie and trying with a "clean" situation.
See after.
Gaetano
PippoPippini 10-17-2006, 07:47 AM Now I try canceling my cookie and trying with a "clean" situation.
Hi Laurens.
I tried with a clean situation, wiping my cookie.
I continue downloading all the articles, but for most of them there is only the title and the highlight, with the message that invite me to log in if I'm a subscriber. The articles are still in the "normal" version, instead of printer friendly version.
Previously (with my credentials in the cookie) I downloaded all articles.
I never received the "403 Economist.com automatic downloading forbidden" message.
Another thing: I can't see my update report. When I select a document and request to view the report, I receive the message "Could not open update report file. (0x5)". Probably with a re-installation I may have all the settings ok ?
Regards
Gaetano
igorsk 10-17-2006, 07:53 AM My guess is that they're checking User-Agent header and/or Referer.
PippoPippini 10-19-2006, 09:42 AM Someone is interested in Reuters feed ?
This is the pattern rule:
http://today\.reuters\.com(.*)type=businessNews&story(.*)
and this is the rewrite rule: http://today.reuters.com/misc/PrinterFriendlyPopup.aspx?type=businessNews&story$2
Hi all
I made some mistake in the pattern and rewrite rule for Reuters.
This is the right one:
Pattern rule: http://today\.reuters\.com(.*)type=(.*)
Rewrite rule: http://today.reuters.com/misc/PrinterFriendlyPopup.aspx?type=$2
Bye
Gaetano
PippoPippini 11-13-2006, 04:07 AM Hi all.
Starting from the beginning of November, Bloomberg stopped its RSS feeds service.
Now the articles can not be downloaded automatically via Sunrise.
Hoping they change the situation.
Regards.
Gaetano
sangahm 12-07-2006, 07:19 PM I've been trying to figure this out for a week to no avail so if anyone can help, I'd be really grateful. There are several writers on Sports Illustrated online you can get RSS feeds for, but SI.com spreads the articles out over 3 or 4 pages, so I'd like to nab just the printer-friendly version, but I can't seem to configure Sunrise XP to do it correctly.
For example, Dr Z's feed is at http://rss.cnn.com/rss/si_dr_z.rss
All of the articles off it start with http://sportsillustrated.cnn.com/2006/writers/dr_z*
and on the first page of each is a link to a single-page printer friendly version that begins http://si.printthis.clickability.com/*
How in the world do I get Sunrise to get that version?
Has anyone been able solve this one yet? I'm trying to get the Austin American Statesman (statesman.com) and it seems to have the same clickability type of redirection.
I've been able to figure out everything except how to get the urlID (eg. &urlID=20433337) which is probably embedded somewhere in the original source page.
Alternatively, if the clickability issue cannot be solved, how can I move all of the banner stuff from the top of the page, before the article, to the bottom? When reading in Plucker on the palm, I have to scroll down several pages of links to other pages before I can actually read the article.
PippoPippini 08-02-2007, 09:35 AM Someone is interested in Reuters feed ?
This is the pattern rule:
http://today\.reuters\.com(.*)type=businessNews&story(.*)
and this is the rewrite rule: http://today.reuters.com/misc/PrinterFriendlyPopup.aspx?type=businessNews&story$2
Hi
Reuters' feed changed some time ago.
Now I tried with this pattern rule
http://www\.reuters\.com(.*)id(.*)\?feedType=RSS
and this rewrite rule
http://www.reuters.com/articlePrint?articleId=$2
But it doesn't work.
Any hint ?
Bye
Gaetano
HeffeD 08-02-2007, 12:17 PM I only grab the "Oddly Enough" feed, but it still works using:
http://today\.reuters\.com(.*)type=(.*)
With the rewrite:
http://today.reuters.com/misc/PrinterFriendlyPopup.aspx?type=$2
I can't imagine why they wouldn't switch all their feeds.
PippoPippini 08-03-2007, 08:20 AM I only grab the "Oddly Enough" feed, but it still works ...
Are you sure ?:blink:
From the RSS page, it seems that also the "Oddly Enough" feed has changed with the new format.
Anyway, I'm interested in the "normal" news feed like "Top News" or "Business".
Regards
Gaetano
HeffeD 08-03-2007, 11:54 AM Are you sure ?:blink:
From the RSS page, it seems that also the "Oddly Enough" feed has changed with the new format.
Yes, I read it every day. Unless Sunrise has the ability to make articles up, it still functions as it has been.
HeffeD 08-03-2007, 02:15 PM Well, I'm at a loss... Does the Top News category also follow this new format? Just as a quick test, I grabbed the Top News RSS using the previous mentioned setup and it downloads just fine. Maybe you have something else going on?
PippoPippini 08-06-2007, 04:59 AM Well, I'm at a loss... Does the Top News category also follow this new format? Just as a quick test, I grabbed the Top News RSS using the previous mentioned setup and it downloads just fine. Maybe you have something else going on?
HI.
Still doesn't work on my Palm.
I tried with "Business News" and "Top news" feeds.
I can not understand how can it work for you.
When I see links in the feeds page on Reuters site, all of them produce links to articles in the forms of
http://www.reuters.com/article/"Feed name"/id"id number"?feedType=RSS
The print form link is http://www.reuters.com/articlePrint?articleId="id number"
There are no more links like "today.reuters.com" in RSS feeds page.
This is the reason for :blink: in my previous post.
I'll continue investigating.
Regards
Gaetano
HeffeD 08-06-2007, 11:23 AM What is the URL of the RSS feed you're on? I'm using http://www.microsite.reuters.com/rss/Topnews/default.aspx. (This one is Top News obviously)
The individual links to the articles are indeed "today.reuters.com".
PippoPippini 08-07-2007, 02:03 AM What is the URL of the RSS feed you're on? I'm using http://www.microsite.reuters.com/rss/Topnews/default.aspx. (This one is Top News obviously)
The individual links to the articles are indeed "today.reuters.com".
I'm using this link http://www.reuters.com/tools/rss which is referring to different RSS feeds. This link came from the bottom of the home page.
With your link, there is only the "Top news" feed.
Now I understood what you spoke about ;)
Regards
Gaetano
HeffeD 08-07-2007, 11:47 AM With your link, there is only the "Top news" feed.
No, they're all there. (At least the 5 I tested were. I didn't check the whole list) The "default.aspx" in the Top News link I posted is actually unnecessary.
In the RSS link you posted, follow the articles link (which loads a Podcast page?) to find out what the actual heading of the RSS feed is in the URL. For example, the page you linked to shows an "International" heading, yet if you follow the link, the URL actually calls it "worldNews" Similarly, "US News" is actually "domesticNews". If you put "domesticNews" in the microsite URL, it grabs the same articles, but in a different URL format. (today.reuters.com)
So a URL of http://www.microsite.reuters.com/rss/domesticNews/ will grab the "US News" feed listed on the page you linked to.
I can't find the link on the Reuters page that reference the microsite URL's, but I originally got it from the Reuters page.
So anyway, you can currently still get your articles on the microsite URL using your old link rewrite format. Hopefully they continue to update microsite.
|