Shiny New E-Book Gizmo: The Amazon Kindle


View Full Version : Help writing profile to get RSS feed


Deputy-Dawg
01-19-2008, 05:21 PM
I am in the throes of learning to program in Python. I have very nearly completed a profile to capture the RSS feed of my local newspaper. I am having a problem returning the print versions of the feeds. I know that there is a corresponding print format for each article.

Each article has the format:

http://www.nwaonline.net/articles/2008/01/19/news/011908arboozmanxna.txt

The corresponding article has the format:

http://www.nwaonline.net/articles/2008/01/19/news/011908arboozmanxna.prt

eg I need only to replace the extension .txt with the extension .prt.

But try as I may I just can't seem to do it. Clearly I have a blind spot. Can anyone please help

kovidgoyal
01-19-2008, 06:24 PM
url = original_url.rpartition('.')[0] + '.prt'

Deputy-Dawg
01-20-2008, 03:25 PM
Thanks, that was the leg up I needed. I have a bit more to do on the profile. When I am done is there anyway that I can integrate it into the GUI? I am so darn clumsy in typing! Being 74 with Parkinson's does make life a bit more complicate.

On the other hand the 'need' to learn yet another language is stimulating.

kovidgoyal
01-20-2008, 03:28 PM
Not at the moment, it's on my TODO list. And wow, I hope I'm capable of learning a new language at 74!

In the meantime, if you post the profile here, I'll add it to the GUI so that it will be available in the next release of libprs500.

Deputy-Dawg
01-20-2008, 04:25 PM
I've attached the one that I have working currently. There are still a couple of gotchas - including how to add some of their other feeds, aside from hard coding that is, and what the optimum number of files to down load.

Thad being said I am now trying to create a code for the other major newspaper on the area, "The Arkansas Democrat Gazette". They use one strange site for their RSS feed. When you access it from their RSS informaton page

http://www.nwanews.com/feeds/

by clicking on the link 'NWAnews.com (all daily "News" sections) it takes you to

feed://feeds.feedburner.com/nwanewsall

and, of course, web2lrf does not recognize a url beginning with 'feed': If you manually enter the address in the address window of Safari you get there and if you enter

http://feeds.feedburner.com/nwanewsall

you are redirected. But neither approach seems to work with web2lrf

kovidgoyal
01-20-2008, 04:29 PM
You can just have the get_feeds function return the feed URL like this


def get_feeds(self):
return [('NWANews', 'http://feeds.feedburner.com/nwanewsall')]

Deputy-Dawg
01-21-2008, 01:51 PM
Thanks, again...
I am appending a newer version of hte profile to get the Morning News. Much to my surprise a number of the print files contain references to images which web2lrf was resolving and making a bit of a mess of the files. I have added a line of coded which seems to have fixed the problem.

The profile for the Democrat Gazette is another thing. The call to the file (the one that would be displayed on your monitor with all the ads and other BS - the url in the "href=" statement) is in the form of:

http://feeds.feedburner.com/~r/nwanewsall/~3/219845886/

which is somewhere resolved to:

http://www.nwanews.com/adg/News/214246/

and I of course want:

http://www.nwanews.com/adg/News/214246/print/

but if you append 'print/' to the originally called url giving you:

http://feeds.feedburner.com/~r/nwanewsall/~3/219845886/print/

it to is resolved to:

http://www.nwanews.com/adg/News/214246/

and although the desired UFL is embedded in the first called file I have yet to come up with code that will extract it with our harming the print file. (This is because the print file and the web file are, in the area in which we are interested are structurally identical)

If you have a moment take a look and see if you can suggest an approach. Also I should note that to even to begin to attempt to extract and use the URL from the display file it is necessary to increase the amount of recursion to 3 which introduces it own set of difficulties.

Sigh!!!! Programing is such fun

Deputy-Dawg
01-21-2008, 07:48 PM
No need to respond to the last question. I found a source for the desired urls in the document. Some times you really do have to read the code quite literally. In any event here is a profile for the Arkansas Democrat Gazette and several wholly owned subsidiaries.

Again Thanks. Once I got a feel for the syntax being used it made climbing on to that new bike a bit easier. Now I have to learn to deal with the editor (or get a new one) (I am using BBedit 8.7 ) sometimes - indeed more often than not - Python will complain about an indent error even when there is none by visual inspection of the code and by checking BBedits format checker. The only fix seems to be to delete the offending code and re-enter it. I am sure this can be automated. I just have not figured it out as yet.

Deputy-Dawg
01-22-2008, 10:00 PM
I am working on another profile and am running into a rather different problem, or at least think I am. The url that I need returned is:

http://www.fides.org/aree/news/newsdet.php?idnews=11302&lan=eng

when I invoke the profile i get the following message:

Macintosh-3:books billc$ web2lrf --verbose --user-profile Agenzia_Fides.py
[ERROR] __init__.pyo:210: Error parsing article:
<item rdf:about="http://www.fides.org/aree/news/newsdet.php?idnews=11302&amp;lan=eng">
<dc:format>text/html</dc:format>
<dc:date>2008-01-21T14:00:00+01:00</dc:date>
<dc:source>http://www.fides.org</dc:source>
<dc:creator>Fides Service</dc:creator>
<title>VATICAN - The Pope's Angelus: &#x201C;The Church's evangelising mission is part of her ecumenical path&#x201D;; &#x201C;I am bound to the university world by love for the quest for truth, for discussion, frank dialogue, respectful of reciprocal positions. All this is also part of the Church's mission &#x201D;</title>
<link>http://www.fides.org/aree/news/newsdet.php?idnews=11302&amp;lan=eng</link>
<description>&lt;b&gt;VATICAN - The Pope's Angelus: &#x201C;The Church's evangelising mission is part of her ecumenical path&#x201D;; &#x201C;I am bound to the university world by love for the quest for truth, for discussion, frank dialogue, respectful of reciprocal positions. All this is also part of the Church's mission &#x201D;&lt;/b&gt;&lt;br&gt;&lt;br&gt;
Vatican City (Agenzia Fides) - On Sunday 20 January the Holy Father Pope Benedict XVI dedicated his midday Angelus reflection to the issue of ecumenism, this being the Week of Prayer for Christian Unity, and to his planned and then cancelled visit...</description>
</item>
Traceback (most recent call last):
File "libprs500/ebooks/lrf/web/profiles/__init__.pyo", line 197, in parse_feeds
File "libprs500/ebooks/lrf/web/profiles/__init__.pyo", line 269, in strptime
KeyError: u'2008-01-21T14:00:0'
[ERROR] __init__.pyo:210: Error parsing article:
<item rdf:about="http://www.fides.org/aree/news/newsdet.php?idnews=11303&amp;lan=eng">
<dc:format>text/html</dc:format>
<dc:date>2008-01-21T14:00:00+01:00</dc:date>
<dc:source>http://www.fides.org</dc:source>
<dc:creator>Fides Service</dc:creator>
<title>VATICAN - Pope Benedict XVI visits Capranica College: &#x201C;Without friendship with Jesus it is impossible for a Christian, and even more so for a priest, to bring to completion the mission entrusted by the Lord &#x201D;</title>
<link>http://www.fides.org/aree/news/newsdet.php?idnews=11303&amp;lan=eng</link>
<description>&lt;b&gt;VATICAN - Pope Benedict XVI visits Capranica College: &#x201C;Without friendship with Jesus it is impossible for a Christian, and even more so for a priest, to bring to completion the mission entrusted by the Lord &#x201D;&lt;/b&gt;&lt;br&gt;&lt;br&gt;
Vatican City (Agenzia Fides) - &#x201C;Under various circumstances I have reminded seminarians and priests of the urgency of nurturing a profound interior life, personal and continual contact with Christ in prayer and contemplation, and genuine striving for...</description>
</item>


the only line in the source file that contains anything that resembles the url is:

<a href="http://www.fides.org/aree/news/newsdet.php?idnews=11302&amp;lan=eng">

which, if I am reading the error message correctly web2lrf cannot parse. I suspect that the problem is in the '&amp;' representation of the '&' in the url, and if that is the case I see no way that I can code anything in the profile to deal with it.

kovidgoyal
01-22-2008, 10:59 PM
No the problem is the weird date format

2008-01-21T14:00:00+01:00


The simple way to fix it is to set

use_pubdate = False


The more correct way to fix it is to override the strptime function



def strptime(self, raw):
return calendar.timegm(time.strptime('%Y-%m-%dT%H:%M:%S+01:00', raw))-3600



You might have to play with the above strptime to get it to parse the date correctly.

Deputy-Dawg
01-23-2008, 11:36 PM
I have added the following to my profile:

import calendar
import time

def strptime(self, raw):
return calendar.timegm(time.strptime('%Y-%m-%dT%H:&M:%S+01:00', raw))-3600



When I run the profile in web2lrf I get the following error message:

Traceback (most recent call last):
File "libprs500/ebooks/lrf/web/profiles/__init__.pyo", line 197, in parse_feeds
File "/Users/billc/Desktop/Books/ag.py", line 34, in strptime
return calendar.timegm(time.strptime('%Y-%m-%dT%H:&M:%S+01:00', raw))-3600
File "_strptime.pyo", line 331, in strptime
ValueError: time data did not match format: data=%Y-%m-%dT%H:&M:%S+01:00 fmt=2008-01-21T14:00:00+01:00


To validate the code I inserted into a profile (nwa2.py) which I knew worked and ran it and, of course, it failed with a similar error message (eg about the formats not matching) I then altered the string to match the one given using the symbols from Pythons documentation and lo...... it works.

Finally I added

use_pubdate = False

and that too works. There is an error in the string, but I sure don't see it! Is there any debug code that would permit me to look at the parameters and data that is being passed? As I read the code the string should match

%Y = Decimal year with century prepended
%m = Decimal month
%d = Decimal day
%H = Decimal Hour (24 hour notation)
%M = Decimal Minutes
%S = Decimal Seconds

the remaining characters eg (within the quotes) "-", ":", "T","1",:0","2","4","8", represent themselves.

But it does not.

BTW the only way to get the profile Dem_Gaz.py to run is to use the use_pubdate = False because. apparently, the files have no publication date - or that is what the error message says.

Got to go to bed. Work on it some more tomorrow.

kovidgoyal
01-24-2008, 01:56 AM
&M should be %M in the format string

Incidentally the next release of libprs500 will have the ability to add user created profiles to the GUI (it's already implemented in svn).

Deputy-Dawg
01-24-2008, 08:58 AM
Yes, it should be. And it was in the original file. I retyped it an made a typo. That being said when the correct string is used (I hope I typed it correctly this morning) I still get the following error message:

[ERROR] __init__.pyo:210: Error parsing article:
<item rdf:about="http://www.fides.org/aree/news/newsdet.php?idnews=11338&amp;lan=eng">
<dc:format>text/html</dc:format>
<dc:date>2008-01-22T14:00:00+01:00</dc:date>
<dc:source>http://www.fides.org</dc:source>
<dc:creator>Fides Service</dc:creator>
<title>ASIA/HOLY LAND - Caritas Jerusalem: calls for an end to humanitarian crisis in Gaza and assistance for Palestinian children</title>
<link>http://www.fides.org/aree/news/newsdet.php?idnews=11338&amp;lan=eng</link>
<description>&lt;b&gt;ASIA/HOLY LAND - Caritas Jerusalem: calls for an end to humanitarian crisis in Gaza and assistance for Palestinian children&lt;/b&gt;&lt;br&gt;&lt;br&gt;
Jerusalem (Agenzia Fides) - Caritas Jerusalem has called for the block of persons and goods which is causing the humanitarian crisis in Gaza to be lifted. It joined major international humanitarian organisations in warning of a serious human and soci...</description>
</item>
Traceback (most recent call last):
File "libprs500/ebooks/lrf/web/profiles/__init__.pyo", line 197, in parse_feeds
File "/Users/billc/Desktop/Books/ag.py", line 34, in strptime
return calendar.timegm(time.strptime('%Y-%m-%dT%H:%M:%S+01:00', raw))-3600
File "_strptime.pyo", line 331, in strptime
ValueError: time data did not match format: data=%Y-%m-%dT%H:%M:%S+01:00 fmt=2008-01-22T14:00:00+01:00


I have examined the value in the line:

<dc:date>2008-01-22T14:00:00+01:00</dc:date>

in a hex editor to see if there were any 'strange" characters in it. There are none. I assume that this is the value that is being passed to strptime. If that is the case I don't understand what is not being matched.

kovidgoyal
01-24-2008, 11:18 AM
oops my mistake should be


time.strptime(raw, '%Y-%m-%dT%H:%M:%S+01:00')

The Old Man
01-25-2008, 08:49 AM
Well, I have been reading this thread and I have learned one thing.
I will never be able to learn how to add feeds. - My fault, not yours.

Any chance of adding a feed from the Jerusalem Post http://www.jpost.com/
to the next version of libprs500?
Thanks

kovidgoyal
01-25-2008, 09:46 AM
All feed requests should go here
https://libprs500.kovidgoyal.net/ticket/405

The Old Man
01-25-2008, 12:45 PM
All feed requests should go here
https://libprs500.kovidgoyal.net/ticket/405
If I knew how to use TrackTickets I probably wouldn't have to.:blink:

kovidgoyal
01-25-2008, 12:46 PM
Just register an account at https://libprs500.kovidgoyal.net/register then login and go to the ticket site, it will let you add a comment. Add a comment with your request.

The Old Man
01-25-2008, 01:03 PM
Just register an account at https://libprs500.kovidgoyal.net/register then login and go to the ticket site, it will let you add a comment. Add a comment with your request.

Well, I did it. Not sure what I did - but I did it.:chinscratch:

kovidgoyal
01-25-2008, 01:28 PM
Now you just have to wait for some kindly soul to write the profile for you :)

The Old Man
01-25-2008, 03:01 PM
Now you just have to wait for some kindly soul to write the profile for you :)

Yes. I wonder who? :xmas:

kovidgoyal
01-25-2008, 03:21 PM
It isn't going to be me :) I prefer to work on the infrastructure of libprs500 and only add feeds if I want to use them. But there have been several people that have expressed an interest in writing feeds, so hopefully one of them is interested in Middle east news. :fingersx:

Deputy-Dawg
01-25-2008, 03:54 PM
The Old Man,
You didn't have to wait long; attached is a quick and dirty that will download the first 10 articles in the following Jerusalem Post feed:

Front Page
Israel News
International News
Middle East News
Editorials

kovidgoyal
The last bit of code fixed up the problem with pubdate in the profile for Agenzia Fides.
I still am having some problems with how the summary is being displayed (cosmetic but ugly - various html tags are being displayed. Most notably <b></b> and <br>)

Meanwhile I have start on one for the Christian Science Monitor. And they have one wild way of directing you to the files. The href points to (and later on in a <link></link>) you are pointed to:

http://rss.csmonitor.com/~r/feeds/top/~3/222417173/p04s01-woaf.html

which resolves to

http://www.csmonitor.com/2008/0124/p04s01-woaf.html

with the print version being at

http://www.csmonitor.com/2008/0124/p04s01-woaf.htm

The rub is that if you change the original address to

http://rss.csmonitor.com/~r/feeds/top/~3/222417173/p04s01-woaf.htm

it too resolves to the .html file.

At first I thought this was going to be an easy one, the date is in the number 222417173 all we have to do is convert it to ascidate parse out the /2008/0124/ as '/%Y/%m%d/' and build the required address string. Doesn't work the number resolves to 1977 01 18. I can fix it by adding 2001 01 07 as an offset (that may have to be 06). Is that likely to be legitimate? Have I overlooked something.

The Christian Science Monitor also does not return a valid pubdate and unless you set use_pubdate = False you go no where. However in examining the source for the feed there always seems to be two date entries for each article

articlesortdate="0222880260.000000"
articlelocaldate="0222885964.644872"

which seem to be the epochdate of the files. would it not be possible to capture either or both? Can I get at them in my profiles? I am a bit unsure what declarations that would have to be made.

The Old Man
01-25-2008, 04:22 PM
The Old Man,
You didn't have to wait long; attached is a quick and dirty that will download the first 10 articles in the following Jerusalem Post feed:

Front Page
Israel News
International News
Middle East News
Editorials


Thank you. Now I will attempt to use it. Wish me luck.:thumbsup:

kovidgoyal
01-25-2008, 04:43 PM
@Deputy-Dawg

Why not let the Christian Science Monitors servers figure out the date mapping for you. Here's some code that should do just that


def print_version(self, url):
resolved_url = self.browser.open(url).geturl()
return resolved_url.strip()[:-1]


It's a little slow as it involves going out to the network, but its reliable.

As for article date, I'm afraid there isn't any way to access that short of re-implementing the parse_feeds function.

Deputy-Dawg
01-25-2008, 10:27 PM
I am attaching a copy of the profile for the Christian Science Monitor. I am having a problem that you may have to see to understand. For reference, every article in the feed has a structure like this:

<div class="apple-rss-article apple-rss-read" onclick="javascript:handleArticleClick(this)" showSeparator="true"
articlesortdate="0223013377.017225" articlesorttitle="gaza busts out of its blockade" articlesortsource="" sourceindex="0" articlesortid="00000000000000000010" articlelocaldate="0223013377.017225" articleid="a91c09df43f4cf6a33ffed73cecf111efe81204a">
<div class="apple-rss-article-footer"></div>

<div class="apple-rss-article-head" >
<div class="apple-rss-unread-dot"><img src="file://localhost/System/Library/Frameworks/PubSub.framework/Versions/A/Resources/PubSubAgent.app/Contents/Resources/unread.tif" width="9" height="9" /></div>
<div class="apple-rss-subject" title="Gaza busts out of its blockade"><a href="http://rss.csmonitor.com/~r/feeds/top/~3/222417168/p01s04-wome.html">Gaza busts out of its blockade</a></a></div>


<div class="apple-rss-summary" >A new hole opens in the Arab-Israeli peace strategy of isolating Hamas.</div>
<div class="apple-rss-date" title="Today, 10:09 PM">Today, 10:09 PM</div>
</div>

<div class="apple-rss-article-body-container">
<div class="apple-rss-article-body">
A new hole opens in the Arab-Israeli peace strategy of isolating Hamas.
<p><a href="http://rss.csmonitor.com/~a/feeds/top?a=rt0NVe"><img src="http://rss.csmonitor.com/~a/feeds/top?i=rt0NVe" border="0" /></a></p>
<div class="feedflare"><a href="http://rss.csmonitor.com/~f/feeds/top?a=7LSTtWD"><img src="http://rss.csmonitor.com/~f/feeds/top?i=7LSTtWD" border="0" /></a> <a href="http://rss.csmonitor.com/~f/feeds/top?a=bYiAxtD"><img src="http://rss.csmonitor.com/~f/feeds/top?i=bYiAxtD" border="0" /></a> <a href="http://rss.csmonitor.com/~f/feeds/top?a=ISh8dED"><img src="http://rss.csmonitor.com/~f/feeds/top?i=ISh8dED" border="0" /></a> <a href="http://rss.csmonitor.com/~f/feeds/top?a=FL3bvEd"><img src="http://rss.csmonitor.com/~f/feeds/top?i=FL3bvEd" border="0" /></a></div>
<img src="http://rss.csmonitor.com/~r/feeds/top/~4/222417168" height="1" width="1" />



&nbsp;<a class="apple-rss-article-link" href="http://rss.csmonitor.com/~r/feeds/top/~3/222417168/p01s04-wome.html">Read more&hellip;</a>
<!-- end articlebody --></div></div>
<!-- end article --></div>


The entire block:

A new hole opens in the Arab-Israeli peace strategy of isolating Hamas.
<p><a href="http://rss.csmonitor.com/~a/feeds/top?a=rt0NVe"><img src="http://rss.csmonitor.com/~a/feeds/top?i=rt0NVe" border="0" /></a></p>
<div class="feedflare"><a href="http://rss.csmonitor.com/~f/feeds/top?a=7LSTtWD"><img src="http://rss.csmonitor.com/~f/feeds/top?i=7LSTtWD" border="0" /></a> <a href="http://rss.csmonitor.com/~f/feeds/top?a=bYiAxtD"><img src="http://rss.csmonitor.com/~f/feeds/top?i=bYiAxtD" border="0" /></a> <a href="http://rss.csmonitor.com/~f/feeds/top?a=ISh8dED"><img src="http://rss.csmonitor.com/~f/feeds/top?i=ISh8dED" border="0" /></a> <a href="http://rss.csmonitor.com/~f/feeds/top?a=FL3bvEd"><img src="http://rss.csmonitor.com/~f/feeds/top?i=FL3bvEd" border="0" /></a></div>
<img src="http://rss.csmonitor.com/~r/feeds/top/~4/222417168" height="1" width="1" />



Is being used as a summary in the contents page, I have tried many various forms in the preprocess_regexps section to no avail. I also tried setting summary_length = 0 (and 100 on the off chance it did accept 0 as an argument) and again no effect. Of course the profile is useable but the output is ugly as sin!

Finally is it possible to embed an HTML option in the profile? Specifically the --ignore-tables, again it is only for cosmetic effects.

kovidgoyal
01-25-2008, 10:34 PM
Set

html_description = True
html2lrf_options = ['--ignore-tables']

Lemoine
01-26-2008, 03:44 PM
Hello,
Using this wonderful program (thank's a lot Govid!), i have tried to add the support for "Le Monde" a french newspaper. It was working pretty well, but yesterday they changed both their structure and encoding, switching from utf8 to iso-8859-1.
Now, my new profile captures the articles but with weird encoding.

If i add in the regex,for instance,

<head><meta HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=ISO-8859-1"></head>

my characters are correct, but all the crap is not stripped from the articles.

Here is my profile

I would be very grateful for your help...:)

kovidgoyal
01-26-2008, 03:59 PM
Not sure what you mean. I tried it and it works fine for me. See attached LRF

Lemoine
01-26-2008, 04:17 PM
This file seems to be fine, but some french letters such as "à, ê,ù..." are not correctly displayed.

à for instance becomes r ...


That is my problem wich appears only in the articles, not in the index and the abstracts.

kovidgoyal
01-26-2008, 04:18 PM
Ah ok I didn't see that because I don't know French. Can you give me some easy to recognize sentence I can use to test things?

Lemoine
01-26-2008, 04:32 PM
In the international section

Article:
"Londres est prêt à oeuvrer à un désarmement nucléraire total"

The correct title is:

"Lemonde.fr:Londres est prêt à oeuvrer à un désarmement nucléaire total - Europe"

Thanks a lot for your help!

kovidgoyal
01-26-2008, 04:55 PM
Try the attached

Lemoine
01-26-2008, 05:09 PM
Awsome!

Thank's a lot for the program and for the help!

Deputy-Dawg
01-26-2008, 10:41 PM
From DefaultProfile

timefmt = ' [%a %d %b %Y]' # The format of the date shown on the first page
url_search_order = ['guid', 'link'] # THe order of elements to search for a URL when parssing the RSS feed
pubdate_fmt = None # The format string used to parse the


Which would imply that only the classes 'link' and 'guid' are searched for the link. This is born out by the fact that when you process the feed from the Denver Post with

use_pubdate = False

get the error message

Skipping article as it does not have a link url

from the source for the feed for each article in the feed the following code appears:

<li class="regularitem" xmlns:dc="http://purl.org/dc/elements/1.1/">
<h4 class="itemtitle">
<a href="http://www.denverpost.com/ci_8088727">
Man hit in crosswalk, killed
</a>
</h4>
<h5 class="itemposttime">
<span>Posted: </span>
Sat, 26 Jan 2008 20:09:37 -0700
</h5>
<div class="itemcontent" name="decodeable">
A 22-year-old Denver resident was killed in Aurora Saturday when a 71-year-old man driving a pickup ran a red light on South Parker Road, then veered into a crosswalk.
</div>
</li>

the url for the article is only contained in the class itemtitle

similarly in the feeds from izvestia the url is only contained in the classes

mainnewstime and mainnewsnotice

and at that only the variable part of the link in the form:

/world/asia/20080127/97803220.html

Which has to be concantenated with http://www.rian.ru to obtain the fully qualified address.

is it possible to handle either of these cases in web2lrf?

BTW a profile runs much faster in the Terminal than when embedded in libprs500, also I have found that if I attempt to run more than about 3 profiles sequentialy librs500 crashes. I can get around the problem by quitting and restarting. No need to remove the previously captured feeds

kovidgoyal
01-27-2008, 12:23 PM
Not sure I follow. The Denver post for example has its links in both <link> and <guid> elements see for example http://feeds.feedburner.com/dp-news-national?format=xml

The problem is that the links are embedded in a CDATA section. So you should write print_version to handle that.

The GUI crashing should be fixed in the next release.

EDIT: Actually, since all the elements in that feed are CDATA escaped, you're going to have to wait for the next release of libprs500 to create a feed for the denver post

Deputy-Dawg
01-27-2008, 01:19 PM
Sigh! What it means is that the Denver Post is offering what it characterizes as RSS feeds at at least 3 different URLs and in at least two different formats. The first one, the one I was questioning is at:

http://feeds.feedburner.com/dp-news-national


The xml feed which can be accessed from the link on the page I was using, and the one you found:

feed://feeds.feedburner.com/dp-news-national?format=xml

and finally one that you will be sent to if you click on the blue "RSS" link in your browsers address box and follow your nose the xml format of the RSS feed

feed://feeds.denverpost.com/dp-news-national?format=xml

which is pointing, I think, to the same page as the previous one.

I guess I will have to wait until the new verion of DefaultProfile is available to work on it any further. This is a rather interesting one to work on because it has no printer friendly version available and to be useful, at all it will be necessary to code a preprocess_regexps that will strip out the nasty bits leaving only the story. Looks like fun. But.....

As for izvestia I think it will be necessary to teach DefaultProfile to work with Russian syntax and the Cyrillic alphabet. I could open a ticket, but is suspect it would a warm day in Siberia before it would happen.

Thanks again for all of the assistance. BTW just as an example of Russian syntax (in English) try this one on for size!

<div class="mainrubric">

or another (in russian, in cyrillic)

<!-- /список новостей -->

And I speak every language except Greek! But that is greek to me! :joker:

Deputy-Dawg
01-28-2008, 07:04 PM
Attached is a profile to capture several of the feeds from Reuters. This proved to be fairly interesting to write. First of all the URL returned by web2lrf on this service only contained the file id. Took me a while to figure it out. Also they do not put up a file that is printer friendly so it was necessary to create code that would parse out the text from the display page. It was doable but it causes the program to be quite slow an apparently is quite cpu intensive. At least the cooling fan in my MacBook Pro runs quite a bit more than what would be its usual want.

in any event enjoy.

kovidgoyal
01-28-2008, 07:38 PM
They do seem to provide a print version for example given the id
USN2740109620080129
The print version of the article is at
http://www.reuters.com/articlePrint?articleId=USN2740109620080129

Also, since you're writing a lot of feeds, can I ask you to attach them to
https://libprs500.kovidgoyal.net/wiki/UserProfiles so other people can find them easily (I'll pick them up for inclusion from there, when I get the time). You will need to create an account https://libprs500.kovidgoyal.net/register and log in https://libprs500.kovidgoyal.net/login before being able to edit the Wiki page. Thanks.

Deputy-Dawg
01-28-2008, 08:56 PM
Be happy to!! Already have an account, just was not sure were to put them. I am working on one for the AP. It seems to work fine except that the TOC points to the end of the article not the beginning.

I to thought that Reuters has a print version available, but when you go to the url you posted you don't in fact go to the print page but back to the display page. I don't know why but...

I am attaching a copy of the AP profile to this message.

Deputy-Dawg
01-29-2008, 04:11 PM
Kovidgoyal,
This is weird. I did a run with my AP profile using --keep-downloaded-files option. I then took the kept files and moved them into the normal user space while preserving the relative path lengths. I then examined them in BBedit, GoLive and Safari and did not find any thing in the code or in the appearance of the files. Finally I converted them to a LRF using the html2lrf function and the TOC points to the end of the stories not the beginning. What have I done wrong? Or better yet how do I fix it?

kovidgoyal
01-29-2008, 04:18 PM
Probably a bug in html2lrf, open a bug report and I'll look at it when I get time.

Deputy-Dawg
01-31-2008, 07:42 AM
I have just uploaded new copies of the Reuters profile and the AP profile. Reuters to correct a minor typo. The AP profile is a full working version

Lemoine
02-24-2008, 01:00 PM
Hello,

The description of each article in the rss feeds disappeared a week ago.

Is it a new feature? is it a bug?

I miss very much those descriptions and i wonder how i can retrieve them.

Anyone has an idea?

Thanks in advance

:thanks:

kovidgoyal
02-24-2008, 01:01 PM
In which feed?

Lemoine
02-24-2008, 02:28 PM
In all feeds i can use...

For instance Newsweek or The New Yorker, but all the feeds are touched by this problem.

Title and Pubdate simply appear, not Description

kovidgoyal
02-24-2008, 03:32 PM
Well Newsweek has started using the description tag for ads

There was a bug affecting the others, which will be fixed in the next release.

Lemoine
02-24-2008, 03:39 PM
Thank you Kovid for all your work....:)

Deputy-Dawg
02-24-2008, 08:10 PM
Kovid,
All That I have checked, even my custom. It appears to be a "feature" web2lrf, because I get the same result when I run it from terminal.

BTW do you have any experience in installing and using GutenMark on an Intel Mac. I can't get it to run on mine. I am pretty sure that Perl is operative in as much as Tidy does work.

kovidgoyal
02-24-2008, 08:17 PM
No I don't use Macs.

Deputy-Dawg
03-09-2008, 02:45 PM
Kovid,
Is it possible to use the web2lrf function to capture fees like:

http://www.bloomberg.com/news/exclusive/

If so, could you give me another leg up?

kovidgoyal
03-09-2008, 02:51 PM
yeah you have to parse the HTML page and create the list of articles manually. See the profile in atlantic.py for an example of how to do this.

Deputy-Dawg
03-09-2008, 04:32 PM
Kovid,
Thanks, I'll take a look.

BTW html2lrf is broken in 4.42. I call it and it just hangs. All I get is a singld
>

at the extreme edge of the terminal screen. All I can do is kill the call with a command period. Sigh!

kovidgoyal
03-09-2008, 05:15 PM
I'm assuming this is on OS X? I jusr re-uploaded a new build of 0.4.42 for OS X. Try it.

Deputy-Dawg
03-09-2008, 06:55 PM
I'm assuming this is on OS X? I jusr re-uploaded a new build of 0.4.42 for OS X. Try it.

Yes, I am using Mac OS X (actually Leopard 10.5.2 on a dual core Macbook Pro) Just moments ago I downloaded the most currentversion of librs-500
tat being 0.4.42 and still have the problem. If there is any thing I can do to help track this gremlin done you have but to ask.

kovidgoyal
03-09-2008, 07:17 PM
Run it on some simple HTML file with the --verbose switch. What happens?

Deputy-Dawg
03-09-2008, 08:19 PM
Kovid,
The results are a tad strange. Allow me to explain. To insure that I always use the same command when I process a book I type it out in a text editor then do a copy and paste into the terminal. With that as a premise, and to be sure that the program hung I copied and pasted the following command line into my terminal and hit a return:

html2lrf --verbose --force-page-break-before-tag='page-break' --blank-after-para --base-font-size=8 creeds-000.html

And terminal hung as I reproted previously.

I then entered the following command were the file creeds-002,html is Chapter one of the book.

html2lrf --verbose creeds-002-html

and darned if the program didn't run.

so then I entered

html2lrf --verbose --page-break-before-tag='page-break' creeds-002.html

and it worked, so then I tried

html2lrf -- verbose --blank-after-para creeds-002.html

and again the program worked so I tried

html2lrf --verbose --page-break-before-tag='page-break' creeds-000.html

and

html2lrf -- verbose --blank-after-para creeds-000.html

thinking it might have been something in the original file that was the difficulty. Wasn't so. So I finally re-entered my original command

html2lrf --verbose --force-page-break-before-tag='page-break' --blank-after-para --base-font-size=8 creeds-000.html

And it now works as well, the long and the short of it is that sine I used just the base command on a very simple file I can no longer reproduce the problem.

kovidgoyal
03-09-2008, 08:23 PM
hmm chalk it up to close encounter of the third kind

balok
03-10-2008, 12:32 PM
Thanks, that was the leg up I needed. I have a bit more to do on the profile. When I am done is there anyway that I can integrate it into the GUI? I am so darn clumsy in typing! Being 74 with Parkinson's does make life a bit more complicate.

On the other hand the 'need' to learn yet another language is stimulating.

Deputy-Dawg, are you really 74? I've never met a person over 50 who can handle a computer beyond pointing and clicking with difficulty. You must have been a professor or an engineer during your working career.

Deputy-Dawg
03-10-2008, 01:42 PM
Deputy-Dawg, are you really 74? I've never met a person over 50 who can handle a computer beyond pointing and clicking with difficulty. You must have been a professor or an engineer during your working career.
Actually I am but 5 weeks from 75 birthday. My degree is in chemistry (science not engineering - there are no plumbers in my house! But that was only because there were no degrees, indeed no courses in computer science I went to school. But while I was in college I developed a real love hate relationship with computers. Being the nerd that I was, at graduation time I had sufficient credits so that I could have had my degree in Chemistry - which was my stated major; mathmatics - which I am convinced is probably true of the vast majority of people who study the hard sciences or in philosophy. It was also the first year that student records had been commited to a computer, which decided on the Thursday before graduation (commencements were always on Sunday) that I had not completed my degree requirements, perhaps I should say we had not completed them because of the ability to choose between three degrees it made the assumption that there were three of me. My mother was devastated! I was not thrilled either.

But the following Monday I went and saw my faculty advisor to find out what courses I needed to take to get my degree. She, with a devilish grin, told me that late on Sunday they realized that there had to be a bug in the program in that although it might be barely possible to have three students in class wiih the same name there just was no way that three students, indeed even two that would ever have the same student ID number. Went further to say that one of my math professors, who was long into statistical analysis was totally incapable of calculating the odds of three students with the same name and the same ID number.

Then she said, cutting to the chase, that I had graduated with the degree in chemistry since that had been my stated major when I had matriculate. Then she told me that she would be happy to enroll me in graduate school.

So as I said a love hate relationship was created. Indeed I threatened on that fateful morn to take fire axe and cut the computers power cord into neat 6" lengths. So it continued for some forty years at the end of which I had become the manager of the engineering computer center of a major US corporation. And there is another story of love and hate!!

Deputy-Dawg
03-10-2008, 08:02 PM
Kovid,
In the attached .zip file is the user-profile for one of my local newspapers. It use to work. Now all it gets is the TOC - no articles. What is strange is that the print file addresses are still the same and the error messages when I run it in terminal do not contain any thing that resembles the URL of the print files. I have enclosed a copy of one such run.

My question is has the newspaper changed something or has something changed in lbprs500?

kovidgoyal
03-10-2008, 08:09 PM
You need to fix the print_version function, the way the feed links to articles seems to have changed.

Deputy-Dawg
03-10-2008, 08:47 PM
Thats what I thought had happened but the link to the print version of

http://www.nwaonline.net/articles/2008/03/10/news/031108lrcandidatefiling.txt

is

http://www.nwaonline.net/articles/2008/03/10/news/031108lrcandidatefiling.prt

which is what I would expect the function as written to return. The only difference I can see, if is different - because I am a bit hazy on how it behaved before, is that the print version opens in a new window. I don't think thats an issue in as much as I have seen others were the print version opened in a new window. Darned if I can put my hands on it though.

kovidgoyal
03-10-2008, 08:56 PM
The format of the feed itself has changed use


url_search_order = ['link', 'guid']

Deputy-Dawg
03-10-2008, 09:31 PM
Thanks, again! that fixed it. But... what sort of landmarks should I have been looking for in the source file if a similar problem occur again. I guess what I am asking for is more generalized solution.

kovidgoyal
03-10-2008, 09:40 PM
Well the log has a bunch of error messages about not being able to fetch .prt URLs. That's your clue, it means either that the print_version function no longer works or that the feed format has changed, causing the URL being fed to print_version to be wrong. You can check that by stick a print url into print_version

Deputy-Dawg
03-10-2008, 10:21 PM
Great minds in the same gutter, well almost. What I did was to put

return url

in and checked the error log. A little sloppier but it works. But by the time I came back to report what I had determined what was going on you had posted the fix. I suppose I should spend a bit of time taking an in depth review of DefaultProfile and see just what more goodies are there. Again thanks!

kovidgoyal
03-10-2008, 10:56 PM
You should probably hold off for a bit. I'm in the process of re-writing web2lrf to make it much more powerful.

balok
03-11-2008, 08:00 AM
I'm in the process of re-writing web2lrf to make it much more powerful.

What kind of changes, or new features, should we expect? Will it handle current custom profiles, or will they need to be rewritten?

kovidgoyal
03-11-2008, 10:34 AM
It will handle current profiles, but in any case the old web2lrf code will remain for a long time, so no need to worry.

It will be multithreaded, handle many different feed formats, have a much more powerful and easy to use preprocessing engine, so you dont have to use regexps, unless you want to. Eventually, it should be smart enough that if you give it just the URL to a feed, it will go a fetch a reasonably sanitized version of the articles.

EDIT: Oh and I forgot that it will have links at the end of each article back to the table of contents

balok
03-12-2008, 07:17 AM
It will handle current profiles, but in any case the old web2lrf code will remain for a long time, so no need to worry.

It will be multithreaded, handle many different feed formats, have a much more powerful and easy to use preprocessing engine, so you dont have to use regexps, unless you want to. Eventually, it should be smart enough that if you give it just the URL to a feed, it will go a fetch a reasonably sanitized version of the articles.

EDIT: Oh and I forgot that it will have links at the end of each article back to the table of contents

All of that sounds really cool. A link to the table of contents, in particular, seems like a no brainer, but I never thought of it. It would be nice if the link would bring you to the contents of the current rss feed (and not the first level table of contents). That way if you're reading say international news, you can stay in that section.

kovidgoyal
03-12-2008, 11:30 AM
All of that sounds really cool. A link to the table of contents, in particular, seems like a no brainer, but I never thought of it. It would be nice if the link would bring you to the contents of the current rss feed (and not the first level table of contents). That way if you're reading say international news, you can stay in that section.

There's an up one level, up two levels and next and previous links.

DaleDe
03-19-2008, 01:08 PM
Deputy-Dawg, are you really 74? I've never met a person over 50 who can handle a computer beyond pointing and clicking with difficulty. You must have been a professor or an engineer during your working career.

You need to get out more.

dale

Necator
05-02-2008, 02:06 AM
Hi, i have some difficulties on
1.making libprs500 see the printable_version URL correctly
2removing the tables.
i would appretiate if you lead me.

1.
Article URL : http://www.radikal.com.tr/haber.php?haberno=XXXXX
Printable URL: http://www.radikal.com.tr/yazici.php?haberno=XXXXX

i tried usning this:
def print_version (self, url):
return url.replace ('http://www.radikal.com.tr/haber.php?haberno=', 'http://www.radikal.com.tr/yazici.php?haberno=')

however it still downloads content from the Article URL

2. The article page has 3 rows of tables and i want the one in the middle
here is an example of the Article: " http://www.radikal.com.tr/haber.php?haberno=253962"

i coppied some lines from The Newyork Times and added --ignore tables--, unfortunately it did no good,
html_description = True
html2lrf_options = ['--ignore-tables']
remove_tags_before = dict(name='img' , attrs='src')
remove_tags_after = dict(id='footer')
remove_tags = [dict(attrs={'class':['articleTools', 'post-tools', 'side_tool']}),
dict(id=['footer', 'table', 'navigation', 'archive', 'side_search', 'blog_sidebar', 'side_tool', 'side_index']),
dict(name=['script', 'noscript'])]

what is it that i am doing wrong?? Thanks

Necator
05-02-2008, 02:26 AM
Hi, altough i am a newbee i happen to jump in python language to read my local newspaper. And as expected i need some advice :)

1. i failed to show libprs500 print_version URL so the conted comes from the Article URL,

Article URL :http://www.radikal.com.tr/haber.php?haberno=253962
Print_vesion URL:http://www.radikal.com.tr/yazici.php?haberno=253962

i tried this which failed:
def print_version (self, url):
return url.replace ('http://www.radikal.com.tr/haber.php?haberno=', 'http://www.radikal.com.tr/yazici.php?haberno=')

2. So i get the feed from article and to get the main news body from the HTML i removed the tables but this time i cannot cut the news body from the rest of thepage, i copied the recipe from the manual (The Newyork Times) which again ended up in failiure,
html_description = True
html2lrf_options = ['--ignore-tables']
remove_tags_before = dict(name='img' , attrs='src')
remove_tags_after = dict(id='footer')
remove_tags = [dict(attrs={'class':['articleTools', 'post-tools', 'side_tool']}),
dict(id=['footer', 'table', 'navigation', 'archive', 'side_search', 'blog_sidebar', 'side_tool', 'side_index']),
dict(name=['script', 'noscript'])]

what is it that i do wrong? Please lead me, thanks anyway.....

kovidgoyal
05-02-2008, 05:54 AM
To get the print version just use


return url.replace('haber.php', 'yazici.php')

Necator
05-02-2008, 07:29 AM
Sorry, still getting the text from Article URL.
" This article is downloaded by Libprs500 from http://radikal.com.tr/haber.php?haberno=254668"


Here is my full recipe if it helps:
title = u'Radikal Gazetesi'
oldest_article = 1
max_articles_per_feed = 15
no_stylesheets = True
extra_css = 'h1 {font: sans-serif large;}\n.byline {font:monospace;}'
html_description = True
html2lrf_options = ['--ignore-tables']
remove_tags_before = dict(name='img' , attrs='src')
remove_tags_after = dict(id='copy')

feeds = [(u'Kose', u'http://www.radikal.com.tr/radikal_yazar.xml')]

def print_version (self, url):
return url.replace ('haber.php', 'yazici.php')

Btw,i am not sure if it matters but the print_version URL is:
http://www.radikal.com.tr/yazici.php?haberno=254484&tarih=01/05/2008&yollayan_sayfa='http%3A%2F%2Fwww.radikal.com.tr%2F haber.php%3Fhaberno%3D254484'
1. print
2. date
3. sending_page

kovidgoyal
05-02-2008, 07:41 AM
You have to indent print_version so it is a part of the class. See attached.

Necator
05-02-2008, 10:44 AM
Sorry but i cant open the file. i tried opening it with notepad and dzsoft. What should i do??

kovidgoyal
05-02-2008, 10:45 AM
use notepad++ (google it)

Necator
05-02-2008, 10:49 AM
Yep just got it. i triedwinrar and extract "test.py". winrar didnt see it automatically. Thank you....

And..
It's alive!! thank you so much.