|
|
View Full Version : Help writing profile to get RSS feed
Deputy-Dawg 01-19-2008, 05:21 PM I am in the throes of learning to program in Python. I have very nearly completed a profile to capture the RSS feed of my local newspaper. I am having a problem returning the print versions of the feeds. I know that there is a corresponding print format for each article.
Each article has the format:
http://www.nwaonline.net/articles/2008/01/19/news/011908arboozmanxna.txt
The corresponding article has the format:
http://www.nwaonline.net/articles/2008/01/19/news/011908arboozmanxna.prt
eg I need only to replace the extension .txt with the extension .prt.
But try as I may I just can't seem to do it. Clearly I have a blind spot. Can anyone please help
kovidgoyal 01-19-2008, 06:24 PM url = original_url.rpartition('.')[0] + '.prt'
Deputy-Dawg 01-20-2008, 03:25 PM Thanks, that was the leg up I needed. I have a bit more to do on the profile. When I am done is there anyway that I can integrate it into the GUI? I am so darn clumsy in typing! Being 74 with Parkinson's does make life a bit more complicate.
On the other hand the 'need' to learn yet another language is stimulating.
kovidgoyal 01-20-2008, 03:28 PM Not at the moment, it's on my TODO list. And wow, I hope I'm capable of learning a new language at 74!
In the meantime, if you post the profile here, I'll add it to the GUI so that it will be available in the next release of libprs500.
Deputy-Dawg 01-20-2008, 04:25 PM I've attached the one that I have working currently. There are still a couple of gotchas - including how to add some of their other feeds, aside from hard coding that is, and what the optimum number of files to down load.
Thad being said I am now trying to create a code for the other major newspaper on the area, "The Arkansas Democrat Gazette". They use one strange site for their RSS feed. When you access it from their RSS informaton page
http://www.nwanews.com/feeds/
by clicking on the link 'NWAnews.com (all daily "News" sections) it takes you to
feed://feeds.feedburner.com/nwanewsall
and, of course, web2lrf does not recognize a url beginning with 'feed': If you manually enter the address in the address window of Safari you get there and if you enter
http://feeds.feedburner.com/nwanewsall
you are redirected. But neither approach seems to work with web2lrf
kovidgoyal 01-20-2008, 04:29 PM You can just have the get_feeds function return the feed URL like this
def get_feeds(self):
return [('NWANews', 'http://feeds.feedburner.com/nwanewsall')]
Deputy-Dawg 01-21-2008, 01:51 PM Thanks, again...
I am appending a newer version of hte profile to get the Morning News. Much to my surprise a number of the print files contain references to images which web2lrf was resolving and making a bit of a mess of the files. I have added a line of coded which seems to have fixed the problem.
The profile for the Democrat Gazette is another thing. The call to the file (the one that would be displayed on your monitor with all the ads and other BS - the url in the "href=" statement) is in the form of:
http://feeds.feedburner.com/~r/nwanewsall/~3/219845886/
which is somewhere resolved to:
http://www.nwanews.com/adg/News/214246/
and I of course want:
http://www.nwanews.com/adg/News/214246/print/
but if you append 'print/' to the originally called url giving you:
http://feeds.feedburner.com/~r/nwanewsall/~3/219845886/print/
it to is resolved to:
http://www.nwanews.com/adg/News/214246/
and although the desired UFL is embedded in the first called file I have yet to come up with code that will extract it with our harming the print file. (This is because the print file and the web file are, in the area in which we are interested are structurally identical)
If you have a moment take a look and see if you can suggest an approach. Also I should note that to even to begin to attempt to extract and use the URL from the display file it is necessary to increase the amount of recursion to 3 which introduces it own set of difficulties.
Sigh!!!! Programing is such fun
Deputy-Dawg 01-21-2008, 07:48 PM No need to respond to the last question. I found a source for the desired urls in the document. Some times you really do have to read the code quite literally. In any event here is a profile for the Arkansas Democrat Gazette and several wholly owned subsidiaries.
Again Thanks. Once I got a feel for the syntax being used it made climbing on to that new bike a bit easier. Now I have to learn to deal with the editor (or get a new one) (I am using BBedit 8.7 ) sometimes - indeed more often than not - Python will complain about an indent error even when there is none by visual inspection of the code and by checking BBedits format checker. The only fix seems to be to delete the offending code and re-enter it. I am sure this can be automated. I just have not figured it out as yet.
Deputy-Dawg 01-22-2008, 10:00 PM I am working on another profile and am running into a rather different problem, or at least think I am. The url that I need returned is:
http://www.fides.org/aree/news/newsdet.php?idnews=11302&lan=eng
when I invoke the profile i get the following message:
Macintosh-3:books billc$ web2lrf --verbose --user-profile Agenzia_Fides.py
[ERROR] __init__.pyo:210: Error parsing article:
<item rdf:about="http://www.fides.org/aree/news/newsdet.php?idnews=11302&lan=eng">
<dc:format>text/html</dc:format>
<dc:date>2008-01-21T14:00:00+01:00</dc:date>
<dc:source>http://www.fides.org</dc:source>
<dc:creator>Fides Service</dc:creator>
<title>VATICAN - The Pope's Angelus: “The Church's evangelising mission is part of her ecumenical path”; “I am bound to the university world by love for the quest for truth, for discussion, frank dialogue, respectful of reciprocal positions. All this is also part of the Church's mission ”</title>
<link>http://www.fides.org/aree/news/newsdet.php?idnews=11302&lan=eng</link>
<description><b>VATICAN - The Pope's Angelus: “The Church's evangelising mission is part of her ecumenical path”; “I am bound to the university world by love for the quest for truth, for discussion, frank dialogue, respectful of reciprocal positions. All this is also part of the Church's mission ”</b><br><br>
Vatican City (Agenzia Fides) - On Sunday 20 January the Holy Father Pope Benedict XVI dedicated his midday Angelus reflection to the issue of ecumenism, this being the Week of Prayer for Christian Unity, and to his planned and then cancelled visit...</description>
</item>
Traceback (most recent call last):
File "libprs500/ebooks/lrf/web/profiles/__init__.pyo", line 197, in parse_feeds
File "libprs500/ebooks/lrf/web/profiles/__init__.pyo", line 269, in strptime
KeyError: u'2008-01-21T14:00:0'
[ERROR] __init__.pyo:210: Error parsing article:
<item rdf:about="http://www.fides.org/aree/news/newsdet.php?idnews=11303&lan=eng">
<dc:format>text/html</dc:format>
<dc:date>2008-01-21T14:00:00+01:00</dc:date>
<dc:source>http://www.fides.org</dc:source>
<dc:creator>Fides Service</dc:creator>
<title>VATICAN - Pope Benedict XVI visits Capranica College: “Without friendship with Jesus it is impossible for a Christian, and even more so for a priest, to bring to completion the mission entrusted by the Lord ”</title>
<link>http://www.fides.org/aree/news/newsdet.php?idnews=11303&lan=eng</link>
<description><b>VATICAN - Pope Benedict XVI visits Capranica College: “Without friendship with Jesus it is impossible for a Christian, and even more so for a priest, to bring to completion the mission entrusted by the Lord ”</b><br><br>
Vatican City (Agenzia Fides) - “Under various circumstances I have reminded seminarians and priests of the urgency of nurturing a profound interior life, personal and continual contact with Christ in prayer and contemplation, and genuine striving for...</description>
</item>
the only line in the source file that contains anything that resembles the url is:
<a href="http://www.fides.org/aree/news/newsdet.php?idnews=11302&lan=eng">
which, if I am reading the error message correctly web2lrf cannot parse. I suspect that the problem is in the '&' representation of the '&' in the url, and if that is the case I see no way that I can code anything in the profile to deal with it.
kovidgoyal 01-22-2008, 10:59 PM No the problem is the weird date format
2008-01-21T14:00:00+01:00
The simple way to fix it is to set
use_pubdate = False
The more correct way to fix it is to override the strptime function
def strptime(self, raw):
return calendar.timegm(time.strptime('%Y-%m-%dT%H:%M:%S+01:00', raw))-3600
You might have to play with the above strptime to get it to parse the date correctly.
Deputy-Dawg 01-23-2008, 11:36 PM I have added the following to my profile:
import calendar
import time
def strptime(self, raw):
return calendar.timegm(time.strptime('%Y-%m-%dT%H:&M:%S+01:00', raw))-3600
When I run the profile in web2lrf I get the following error message:
Traceback (most recent call last):
File "libprs500/ebooks/lrf/web/profiles/__init__.pyo", line 197, in parse_feeds
File "/Users/billc/Desktop/Books/ag.py", line 34, in strptime
return calendar.timegm(time.strptime('%Y-%m-%dT%H:&M:%S+01:00', raw))-3600
File "_strptime.pyo", line 331, in strptime
ValueError: time data did not match format: data=%Y-%m-%dT%H:&M:%S+01:00 fmt=2008-01-21T14:00:00+01:00
To validate the code I inserted into a profile (nwa2.py) which I knew worked and ran it and, of course, it failed with a similar error message (eg about the formats not matching) I then altered the string to match the one given using the symbols from Pythons documentation and lo...... it works.
Finally I added
use_pubdate = False
and that too works. There is an error in the string, but I sure don't see it! Is there any debug code that would permit me to look at the parameters and data that is being passed? As I read the code the string should match
%Y = Decimal year with century prepended
%m = Decimal month
%d = Decimal day
%H = Decimal Hour (24 hour notation)
%M = Decimal Minutes
%S = Decimal Seconds
the remaining characters eg (within the quotes) "-", ":", "T","1",:0","2","4","8", represent themselves.
But it does not.
BTW the only way to get the profile Dem_Gaz.py to run is to use the use_pubdate = False because. apparently, the files have no publication date - or that is what the error message says.
Got to go to bed. Work on it some more tomorrow.
kovidgoyal 01-24-2008, 01:56 AM &M should be %M in the format string
Incidentally the next release of libprs500 will have the ability to add user created profiles to the GUI (it's already implemented in svn).
Deputy-Dawg 01-24-2008, 08:58 AM Yes, it should be. And it was in the original file. I retyped it an made a typo. That being said when the correct string is used (I hope I typed it correctly this morning) I still get the following error message:
[ERROR] __init__.pyo:210: Error parsing article:
<item rdf:about="http://www.fides.org/aree/news/newsdet.php?idnews=11338&lan=eng">
<dc:format>text/html</dc:format>
<dc:date>2008-01-22T14:00:00+01:00</dc:date>
<dc:source>http://www.fides.org</dc:source>
<dc:creator>Fides Service</dc:creator>
<title>ASIA/HOLY LAND - Caritas Jerusalem: calls for an end to humanitarian crisis in Gaza and assistance for Palestinian children</title>
<link>http://www.fides.org/aree/news/newsdet.php?idnews=11338&lan=eng</link>
<description><b>ASIA/HOLY LAND - Caritas Jerusalem: calls for an end to humanitarian crisis in Gaza and assistance for Palestinian children</b><br><br>
Jerusalem (Agenzia Fides) - Caritas Jerusalem has called for the block of persons and goods which is causing the humanitarian crisis in Gaza to be lifted. It joined major international humanitarian organisations in warning of a serious human and soci...</description>
</item>
Traceback (most recent call last):
File "libprs500/ebooks/lrf/web/profiles/__init__.pyo", line 197, in parse_feeds
File "/Users/billc/Desktop/Books/ag.py", line 34, in strptime
return calendar.timegm(time.strptime('%Y-%m-%dT%H:%M:%S+01:00', raw))-3600
File "_strptime.pyo", line 331, in strptime
ValueError: time data did not match format: data=%Y-%m-%dT%H:%M:%S+01:00 fmt=2008-01-22T14:00:00+01:00
I have examined the value in the line:
<dc:date>2008-01-22T14:00:00+01:00</dc:date>
in a hex editor to see if there were any 'strange" characters in it. There are none. I assume that this is the value that is being passed to strptime. If that is the case I don't understand what is not being matched.
kovidgoyal 01-24-2008, 11:18 AM oops my mistake should be
time.strptime(raw, '%Y-%m-%dT%H:%M:%S+01:00')
The Old Man 01-25-2008, 08:49 AM Well, I have been reading this thread and I have learned one thing.
I will never be able to learn how to add feeds. - My fault, not yours.
Any chance of adding a feed from the Jerusalem Post http://www.jpost.com/
to the next version of libprs500?
Thanks
kovidgoyal 01-25-2008, 09:46 AM All feed requests should go here
https://libprs500.kovidgoyal.net/ticket/405
The Old Man 01-25-2008, 12:45 PM All feed requests should go here
https://libprs500.kovidgoyal.net/ticket/405
If I knew how to use TrackTickets I probably wouldn't have to.:blink:
kovidgoyal 01-25-2008, 12:46 PM Just register an account at https://libprs500.kovidgoyal.net/register then login and go to the ticket site, it will let you add a comment. Add a comment with your request.
The Old Man 01-25-2008, 01:03 PM Just register an account at https://libprs500.kovidgoyal.net/register then login and go to the ticket site, it will let you add a comment. Add a comment with your request.
Well, I did it. Not sure what I did - but I did it.:chinscratch:
kovidgoyal 01-25-2008, 01:28 PM Now you just have to wait for some kindly soul to write the profile for you :)
The Old Man 01-25-2008, 03:01 PM Now you just have to wait for some kindly soul to write the profile for you :)
Yes. I wonder who? :xmas:
kovidgoyal 01-25-2008, 03:21 PM It isn't going to be me :) I prefer to work on the infrastructure of libprs500 and only add feeds if I want to use them. But there have been several people that have expressed an interest in writing feeds, so hopefully one of them is interested in Middle east news. :fingersx:
Deputy-Dawg 01-25-2008, 03:54 PM The Old Man,
You didn't have to wait long; attached is a quick and dirty that will download the first 10 articles in the following Jerusalem Post feed:
Front Page
Israel News
International News
Middle East News
Editorials
kovidgoyal
The last bit of code fixed up the problem with pubdate in the profile for Agenzia Fides.
I still am having some problems with how the summary is being displayed (cosmetic but ugly - various html tags are being displayed. Most notably <b></b> and <br>)
Meanwhile I have start on one for the Christian Science Monitor. And they have one wild way of directing you to the files. The href points to (and later on in a <link></link>) you are pointed to:
http://rss.csmonitor.com/~r/feeds/top/~3/222417173/p04s01-woaf.html
which resolves to
http://www.csmonitor.com/2008/0124/p04s01-woaf.html
with the print version being at
http://www.csmonitor.com/2008/0124/p04s01-woaf.htm
The rub is that if you change the original address to
http://rss.csmonitor.com/~r/feeds/top/~3/222417173/p04s01-woaf.htm
it too resolves to the .html file.
At first I thought this was going to be an easy one, the date is in the number 222417173 all we have to do is convert it to ascidate parse out the /2008/0124/ as '/%Y/%m%d/' and build the required address string. Doesn't work the number resolves to 1977 01 18. I can fix it by adding 2001 01 07 as an offset (that may have to be 06). Is that likely to be legitimate? Have I overlooked something.
The Christian Science Monitor also does not return a valid pubdate and unless you set use_pubdate = False you go no where. However in examining the source for the feed there always seems to be two date entries for each article
articlesortdate="0222880260.000000"
articlelocaldate="0222885964.644872"
which seem to be the epochdate of the files. would it not be possible to capture either or both? Can I get at them in my profiles? I am a bit unsure what declarations that would have to be made.
The Old Man 01-25-2008, 04:22 PM The Old Man,
You didn't have to wait long; attached is a quick and dirty that will download the first 10 articles in the following Jerusalem Post feed:
Front Page
Israel News
International News
Middle East News
Editorials
Thank you. Now I will attempt to use it. Wish me luck.:thumbsup:
kovidgoyal 01-25-2008, 04:43 PM @Deputy-Dawg
Why not let the Christian Science Monitors servers figure out the date mapping for you. Here's some code that should do just that
def print_version(self, url):
resolved_url = self.browser.open(url).geturl()
return resolved_url.strip()[:-1]
It's a little slow as it involves going out to the network, but its reliable.
As for article date, I'm afraid there isn't any way to access that short of re-implementing the parse_feeds function.
Deputy-Dawg 01-25-2008, 10:27 PM I am attaching a copy of the profile for the Christian Science Monitor. I am having a problem that you may have to see to understand. For reference, every article in the feed has a structure like this:
<div class="apple-rss-article apple-rss-read" onclick="javascript:handleArticleClick(this)" showSeparator="true"
articlesortdate="0223013377.017225" articlesorttitle="gaza busts out of its blockade" articlesortsource="" sourceindex="0" articlesortid="00000000000000000010" articlelocaldate="0223013377.017225" articleid="a91c09df43f4cf6a33ffed73cecf111efe81204a">
<div class="apple-rss-article-footer"></div>
<div class="apple-rss-article-head" >
<div class="apple-rss-unread-dot"><img src="file://localhost/System/Library/Frameworks/PubSub.framework/Versions/A/Resources/PubSubAgent.app/Contents/Resources/unread.tif" width="9" height="9" /></div>
<div class="apple-rss-subject" title="Gaza busts out of its blockade"><a href="http://rss.csmonitor.com/~r/feeds/top/~3/222417168/p01s04-wome.html">Gaza busts out of its blockade</a></a></div>
<div class="apple-rss-summary" >A new hole opens in the Arab-Israeli peace strategy of isolating Hamas.</div>
<div class="apple-rss-date" title="Today, 10:09 PM">Today, 10:09 PM</div>
</div>
<div class="apple-rss-article-body-container">
<div class="apple-rss-article-body">
A new hole opens in the Arab-Israeli peace strategy of isolating Hamas.
<p><a href="http://rss.csmonitor.com/~a/feeds/top?a=rt0NVe"><img src="http://rss.csmonitor.com/~a/feeds/top?i=rt0NVe" border="0" /></a></p>
<div class="feedflare"><a href="http://rss.csmonitor.com/~f/feeds/top?a=7LSTtWD"><img src="http://rss.csmonitor.com/~f/feeds/top?i=7LSTtWD" border="0" /></a> <a href="http://rss.csmonitor.com/~f/feeds/top?a=bYiAxtD"><img src="http://rss.csmonitor.com/~f/feeds/top?i=bYiAxtD" border="0" /></a> <a href="http://rss.csmonitor.com/~f/feeds/top?a=ISh8dED"><img src="http://rss.csmonitor.com/~f/feeds/top?i=ISh8dED" border="0" /></a> <a href="http://rss.csmonitor.com/~f/feeds/top?a=FL3bvEd"><img src="http://rss.csmonitor.com/~f/feeds/top?i=FL3bvEd" border="0" /></a></div>
<img src="http://rss.csmonitor.com/~r/feeds/top/~4/222417168" height="1" width="1" />
<a class="apple-rss-article-link" href="http://rss.csmonitor.com/~r/feeds/top/~3/222417168/p01s04-wome.html">Read more…</a>
<!-- end articlebody --></div></div>
<!-- end article --></div>
The entire block:
A new hole opens in the Arab-Israeli peace strategy of isolating Hamas.
<p><a href="http://rss.csmonitor.com/~a/feeds/top?a=rt0NVe"><img src="http://rss.csmonitor.com/~a/feeds/top?i=rt0NVe" border="0" /></a></p>
<div class="feedflare"><a href="http://rss.csmonitor.com/~f/feeds/top?a=7LSTtWD"><img src="http://rss.csmonitor.com/~f/feeds/top?i=7LSTtWD" border="0" /></a> <a href="http://rss.csmonitor.com/~f/feeds/top?a=bYiAxtD"><img src="http://rss.csmonitor.com/~f/feeds/top?i=bYiAxtD" border="0" /></a> <a href="http://rss.csmonitor.com/~f/feeds/top?a=ISh8dED"><img src="http://rss.csmonitor.com/~f/feeds/top?i=ISh8dED" border="0" /></a> <a href="http://rss.csmonitor.com/~f/feeds/top?a=FL3bvEd"><img src="http://rss.csmonitor.com/~f/feeds/top?i=FL3bvEd" border="0" /></a></div>
<img src="http://rss.csmonitor.com/~r/feeds/top/~4/222417168" height="1" width="1" />
Is being used as a summary in the contents page, I have tried many various forms in the preprocess_regexps section to no avail. I also tried setting summary_length = 0 (and 100 on the off chance it did accept 0 as an argument) and again no effect. Of course the profile is useable but the output is ugly as sin!
Finally is it possible to embed an HTML option in the profile? Specifically the --ignore-tables, again it is only for cosmetic effects.
kovidgoyal 01-25-2008, 10:34 PM Set
html_description = True
html2lrf_options = ['--ignore-tables']
Lemoine 01-26-2008, 03:44 PM Hello,
Using this wonderful program (thank's a lot Govid!), i have tried to add the support for "Le Monde" a french newspaper. It was working pretty well, but yesterday they changed both their structure and encoding, switching from utf8 to iso-8859-1.
Now, my new profile captures the articles but with weird encoding.
If i add in the regex,for instance,
<head><meta HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=ISO-8859-1"></head>
my characters are correct, but all the crap is not stripped from the articles.
Here is my profile
I would be very grateful for your help...:)
kovidgoyal 01-26-2008, 03:59 PM Not sure what you mean. I tried it and it works fine for me. See attached LRF
Lemoine 01-26-2008, 04:17 PM This file seems to be fine, but some french letters such as "à, ê,ù..." are not correctly displayed.
à for instance becomes r ...
That is my problem wich appears only in the articles, not in the index and the abstracts.
kovidgoyal 01-26-2008, 04:18 PM Ah ok I didn't see that because I don't know French. Can you give me some easy to recognize sentence I can use to test things?
Lemoine 01-26-2008, 04:32 PM In the international section
Article:
"Londres est prêt à oeuvrer à un désarmement nucléraire total"
The correct title is:
"Lemonde.fr:Londres est prêt à oeuvrer à un désarmement nucléaire total - Europe"
Thanks a lot for your help!
kovidgoyal 01-26-2008, 04:55 PM Try the attached
Lemoine 01-26-2008, 05:09 PM Awsome!
Thank's a lot for the program and for the help!
Deputy-Dawg 01-26-2008, 10:41 PM From DefaultProfile
timefmt = ' [%a %d %b %Y]' # The format of the date shown on the first page
url_search_order = ['guid', 'link'] # THe order of elements to search for a URL when parssing the RSS feed
pubdate_fmt = None # The format string used to parse the
Which would imply that only the classes 'link' and 'guid' are searched for the link. This is born out by the fact that when you process the feed from the Denver Post with
use_pubdate = False
get the error message
Skipping article as it does not have a link url
from the source for the feed for each article in the feed the following code appears:
<li class="regularitem" xmlns:dc="http://purl.org/dc/elements/1.1/">
<h4 class="itemtitle">
<a href="http://www.denverpost.com/ci_8088727">
Man hit in crosswalk, killed
</a>
</h4>
<h5 class="itemposttime">
<span>Posted: </span>
Sat, 26 Jan 2008 20:09:37 -0700
</h5>
<div class="itemcontent" name="decodeable">
A 22-year-old Denver resident was killed in Aurora Saturday when a 71-year-old man driving a pickup ran a red light on South Parker Road, then veered into a crosswalk.
</div>
</li>
the url for the article is only contained in the class itemtitle
similarly in the feeds from izvestia the url is only contained in the classes
mainnewstime and mainnewsnotice
and at that only the variable part of the link in the form:
/world/asia/20080127/97803220.html
Which has to be concantenated with http://www.rian.ru to obtain the fully qualified address.
is it possible to handle either of these cases in web2lrf?
BTW a profile runs much faster in the Terminal than when embedded in libprs500, also I have found that if I attempt to run more than about 3 profiles sequentialy librs500 crashes. I can get around the problem by quitting and restarting. No need to remove the previously captured feeds
kovidgoyal 01-27-2008, 12:23 PM Not sure I follow. The Denver post for example has its links in both <link> and <guid> elements see for example http://feeds.feedburner.com/dp-news-national?format=xml
The problem is that the links are embedded in a CDATA section. So you should write print_version to handle that.
The GUI crashing should be fixed in the next release.
EDIT: Actually, since all the elements in that feed are CDATA escaped, you're going to have to wait for the next release of libprs500 to create a feed for the denver post
Deputy-Dawg 01-27-2008, 01:19 PM Sigh! What it means is that the Denver Post is offering what it characterizes as RSS feeds at at least 3 different URLs and in at least two different formats. The first one, the one I was questioning is at:
http://feeds.feedburner.com/dp-news-national
The xml feed which can be accessed from the link on the page I was using, and the one you found:
feed://feeds.feedburner.com/dp-news-national?format=xml
and finally one that you will be sent to if you click on the blue "RSS" link in your browsers address box and follow your nose the xml format of the RSS feed
feed://feeds.denverpost.com/dp-news-national?format=xml
which is pointing, I think, to the same page as the previous one.
I guess I will have to wait until the new verion of DefaultProfile is available to work on it any further. This is a rather interesting one to work on because it has no printer friendly version available and to be useful, at all it will be necessary to code a preprocess_regexps that will strip out the nasty bits leaving only the story. Looks like fun. But.....
As for izvestia I think it will be necessary to teach DefaultProfile to work with Russian syntax and the Cyrillic alphabet. I could open a ticket, but is suspect it would a warm day in Siberia before it would happen.
Thanks again for all of the assistance. BTW just as an example of Russian syntax (in English) try this one on for size!
<div class="mainrubric">
or another (in russian, in cyrillic)
<!-- /список новостей -->
And I speak every language except Greek! But that is greek to me! :joker:
Deputy-Dawg 01-28-2008, 07:04 PM Attached is a profile to capture several of the feeds from Reuters. This proved to be fairly interesting to write. First of all the URL returned by web2lrf on this service only contained the file id. Took me a while to figure it out. Also they do not put up a file that is printer friendly so it was necessary to create code that would parse out the text from the display page. It was doable but it causes the program to be quite slow an apparently is quite cpu intensive. At least the cooling fan in my MacBook Pro runs quite a bit more than what would be its usual want.
in any event enjoy.
kovidgoyal 01-28-2008, 07:38 PM They do seem to provide a print version for example given the id
USN2740109620080129
The print version of the article is at
http://www.reuters.com/articlePrint?articleId=USN2740109620080129
Also, since you're writing a lot of feeds, can I ask you to attach them to
https://libprs500.kovidgoyal.net/wiki/UserProfiles so other people can find them easily (I'll pick them up for inclusion from there, when I get the time). You will need to create an account https://libprs500.kovidgoyal.net/register and log in https://libprs500.kovidgoyal.net/login before being able to edit the Wiki page. Thanks.
Deputy-Dawg 01-28-2008, 08:56 PM Be happy to!! Already have an account, just was not sure were to put them. I am working on one for the AP. It seems to work fine except that the TOC points to the end of the article not the beginning.
I to thought that Reuters has a print version available, but when you go to the url you posted you don't in fact go to the print page but back to the display page. I don't know why but...
I am attaching a copy of the AP profile to this message.
Deputy-Dawg 01-29-2008, 04:11 PM Kovidgoyal,
This is weird. I did a run with my AP profile using --keep-downloaded-files option. I then took the kept files and moved them into the normal user space while preserving the relative path lengths. I then examined them in BBedit, GoLive and Safari and did not find any thing in the code or in the appearance of the files. Finally I converted them to a LRF using the html2lrf function and the TOC points to the end of the stories not the beginning. What have I done wrong? Or better yet how do I fix it?
kovidgoyal 01-29-2008, 04:18 PM Probably a bug in html2lrf, open a bug report and I'll look at it when I get time.
Deputy-Dawg 01-31-2008, 07:42 AM I have just uploaded new copies of the Reuters profile and the AP profile. Reuters to correct a minor typo. The AP profile is a full working version
Lemoine 02-24-2008, 01:00 PM Hello,
The description of each article in the rss feeds disappeared a week ago.
Is it a new feature? is it a bug?
I miss very much those descriptions and i wonder how i can retrieve them.
Anyone has an idea?
Thanks in advance
:thanks:
kovidgoyal 02-24-2008, 01:01 PM In which feed?
Lemoine 02-24-2008, 02:28 PM In all feeds i can use...
For instance Newsweek or The New Yorker, but all the feeds are touched by this problem.
Title and Pubdate simply appear, not Description
kovidgoyal 02-24-2008, 03:32 PM Well Newsweek has started using the description tag for ads
There was a bug affecting the others, which will be fixed in the next release.
Lemoine 02-24-2008, 03:39 PM Thank you Kovid for all your work....:)
Deputy-Dawg 02-24-2008, 08:10 PM Kovid,
All That I have checked, even my custom. It appears to be a "feature" web2lrf, because I get the same result when I run it from terminal.
BTW do you have any experience in installing and using GutenMark on an Intel Mac. I can't get it to run on mine. I am pretty sure that Perl is operative in as much as Tidy does work.
kovidgoyal 02-24-2008, 08:17 PM No I don't use Macs.
Deputy-Dawg 03-09-2008, 02:45 PM Kovid,
Is it possible to use the web2lrf function to capture fees like:
http://www.bloomberg.com/news/exclusive/
If so, could you give me another leg up?
kovidgoyal 03-09-2008, 02:51 PM yeah you have to parse the HTML page and create the list of articles manually. See the profile in atlantic.py for an example of how to do this.
Deputy-Dawg 03-09-2008, 04:32 PM Kovid,
Thanks, I'll take a look.
BTW html2lrf is broken in 4.42. I call it and it just hangs. All I get is a singld
>
at the extreme edge of the terminal screen. All I can do is kill the call with a command period. Sigh!
kovidgoyal 03-09-2008, 05:15 PM I'm assuming this is on OS X? I jusr re-uploaded a new build of 0.4.42 for OS X. Try it.
Deputy-Dawg 03-09-2008, 06:55 PM I'm assuming this is on OS X? I jusr re-uploaded a new build of 0.4.42 for OS X. Try it.
Yes, I am using Mac OS X (actually Leopard 10.5.2 on a dual core Macbook Pro) Just moments ago I downloaded the most currentversion of librs-500
tat being 0.4.42 and still have the problem. If there is any thing I can do to help track this gremlin done you have but to ask.
kovidgoyal 03-09-2008, 07:17 PM Run it on some simple HTML file with the --verbose switch. What happens?
Deputy-Dawg 03-09-2008, 08:19 PM Kovid,
The results are a tad strange. Allow me to explain. To insure that I always use the same command when I process a book I type it out in a text editor then do a copy and paste into the terminal. With that as a premise, and to be sure that the program hung I copied and pasted the following command line into my terminal and hit a return:
html2lrf --verbose --force-page-break-before-tag='page-break' --blank-after-para --base-font-size=8 creeds-000.html
And terminal hung as I reproted previously.
I then entered the following command were the file creeds-002,html is Chapter one of the book.
html2lrf --verbose creeds-002-html
and darned if the program didn't run.
so then I entered
html2lrf --verbose --page-break-before-tag='page-break' creeds-002.html
and it worked, so then I tried
html2lrf -- verbose --blank-after-para creeds-002.html
and again the program worked so I tried
html2lrf --verbose --page-break-before-tag='page-break' creeds-000.html
and
html2lrf -- verbose --blank-after-para creeds-000.html
thinking it might have been something in the original file that was the difficulty. Wasn't so. So I finally re-entered my original command
html2lrf --verbose --force-page-break-before-tag='page-break' --blank-after-para --base-font-size=8 creeds-000.html
And it now works as well, the long and the short of it is that sine I used just the base command on a very simple file I can no longer reproduce the problem.
kovidgoyal 03-09-2008, 08:23 PM hmm chalk it up to close encounter of the third kind
balok 03-10-2008, 12:32 PM Thanks, that was the leg up I needed. I have a bit more to do on the profile. When I am done is there anyway that I can integrate it into the GUI? I am so darn clumsy in typing! Being 74 with Parkinson's does make life a bit more complicate.
On the other hand the 'need' to learn yet another language is stimulating.
Deputy-Dawg, are you really 74? I've never met a person over 50 who can handle a computer beyond pointing and clicking with difficulty. You must have been a professor or an engineer during your working career.
Deputy-Dawg 03-10-2008, 01:42 PM Deputy-Dawg, are you really 74? I've never met a person over 50 who can handle a computer beyond pointing and clicking with difficulty. You must have been a professor or an engineer during your working career.
Actually I am but 5 weeks from 75 birthday. My degree is in chemistry (science not engineering - there are no plumbers in my house! But that was only because there were no degrees, indeed no courses in computer science I went to school. But while I was in college I developed a real love hate relationship with computers. Being the nerd that I was, at graduation time I had sufficient credits so that I could have had my degree in Chemistry - which was my stated major; mathmatics - which I am convinced is probably true of the vast majority of people who study the hard sciences or in philosophy. It was also the first year that student records had been commited to a computer, which decided on the Thursday before graduation (commencements were always on Sunday) that I had not completed my degree requirements, perhaps I should say we had not completed them because of the ability to choose between three degrees it made the assumption that there were three of me. My mother was devastated! I was not thrilled either.
But the following Monday I went and saw my faculty advisor to find out what courses I needed to take to get my degree. She, with a devilish grin, told me that late on Sunday they realized that there had to be a bug in the program in that although it might be barely possible to have three students in class wiih the same name there just was no way that three students, indeed even two that would ever have the same student ID number. Went further to say that one of my math professors, who was long into statistical analysis was totally incapable of calculating the odds of three students with the same name and the same ID number.
Then she said, cutting to the chase, that I had graduated with the degree in chemistry since that had been my stated major when I had matriculate. Then she told me that she would be happy to enroll me in graduate school.
So as I said a love hate relationship was created. Indeed I threatened on that fateful morn to take fire axe and cut the computers power cord into neat 6" lengths. So it continued for some forty years at the end of which I had become the manager of the engineering computer center of a major US corporation. And there is another story of love and hate!!
Deputy-Dawg 03-10-2008, 08:02 PM Kovid,
In the attached .zip file is the user-profile for one of my local newspapers. It use to work. Now all it gets is the TOC - no articles. What is strange is that the print file addresses are still the same and the error messages when I run it in terminal do not contain any thing that resembles the URL of the print files. I have enclosed a copy of one such run.
My question is has the newspaper changed something or has something changed in lbprs500?
kovidgoyal 03-10-2008, 08:09 PM You need to fix the print_version function, the way the feed links to articles seems to have changed.
Deputy-Dawg 03-10-2008, 08:47 PM Thats what I thought had happened but the link to the print version of
http://www.nwaonline.net/articles/2008/03/10/news/031108lrcandidatefiling.txt
is
http://www.nwaonline.net/articles/2008/03/10/news/031108lrcandidatefiling.prt
which is what I would expect the function as written to return. The only difference I can see, if is different - because I am a bit hazy on how it behaved before, is that the print version opens in a new window. I don't think thats an issue in as much as I have seen others were the print version opened in a new window. Darned if I can put my hands on it though.
kovidgoyal 03-10-2008, 08:56 PM The format of the feed itself has changed use
url_search_order = ['link', 'guid']
Deputy-Dawg 03-10-2008, 09:31 PM Thanks, again! that fixed it. But... what sort of landmarks should I have been looking for in the source file if a similar problem occur again. I guess what I am asking for is more generalized solution.
kovidgoyal 03-10-2008, 09:40 PM Well the log has a bunch of error messages about not being able to fetch .prt URLs. That's your clue, it means either that the print_version function no longer works or that the feed format has changed, causing the URL being fed to print_version to be wrong. You can check that by stick a print url into print_version
Deputy-Dawg 03-10-2008, 10:21 PM Great minds in the same gutter, well almost. What I did was to put
return url
in and checked the error log. A little sloppier but it works. But by the time I came back to report what I had determined what was going on you had posted the fix. I suppose I should spend a bit of time taking an in depth review of DefaultProfile and see just what more goodies are there. Again thanks!
kovidgoyal 03-10-2008, 10:56 PM You should probably hold off for a bit. I'm in the process of re-writing web2lrf to make it much more powerful.
balok 03-11-2008, 08:00 AM I'm in the process of re-writing web2lrf to make it much more powerful.
What kind of changes, or new features, should we expect? Will it handle current custom profiles, or will they need to be rewritten?
kovidgoyal 03-11-2008, 10:34 AM It will handle current profiles, but in any case the old web2lrf code will remain for a long time, so no need to worry.
It will be multithreaded, handle many different feed formats, have a much more powerful and easy to use preprocessing engine, so you dont have to use regexps, unless you want to. Eventually, it should be smart enough that if you give it just the URL to a feed, it will go a fetch a reasonably sanitized version of the articles.
EDIT: Oh and I forgot that it will have links at the end of each article back to the table of contents
balok 03-12-2008, 07:17 AM It will handle current profiles, but in any case the old web2lrf code will remain for a long time, so no need to worry.
It will be multithreaded, handle many different feed formats, have a much more powerful and easy to use preprocessing engine, so you dont have to use regexps, unless you want to. Eventually, it should be smart enough that if you give it just the URL to a feed, it will go a fetch a reasonably sanitized version of the articles.
EDIT: Oh and I forgot that it will have links at the end of each article back to the table of contents
All of that sounds really cool. A link to the table of contents, in particular, seems like a no brainer, but I never thought of it. It would be nice if the link would bring you to the contents of the current rss feed (and not the first level table of contents). That way if you're reading say international news, you can stay in that section.
kovidgoyal 03-12-2008, 11:30 AM All of that sounds really cool. A link to the table of contents, in particular, seems like a no brainer, but I never thought of it. It would be nice if the link would bring you to the contents of the current rss feed (and not the first level table of contents). That way if you're reading say international news, you can stay in that section.
There's an up one level, up two levels and next and previous links.
DaleDe 03-19-2008, 01:08 PM Deputy-Dawg, are you really 74? I've never met a person over 50 who can handle a computer beyond pointing and clicking with difficulty. You must have been a professor or an engineer during your working career.
You need to get out more.
dale
Necator 05-02-2008, 02:06 AM Hi, i have some difficulties on
1.making libprs500 see the printable_version URL correctly
2removing the tables.
i would appretiate if you lead me.
1.
Article URL : http://www.radikal.com.tr/haber.php?haberno=XXXXX
Printable URL: http://www.radikal.com.tr/yazici.php?haberno=XXXXX
i tried usning this:
def print_version (self, url):
return url.replace ('http://www.radikal.com.tr/haber.php?haberno=', 'http://www.radikal.com.tr/yazici.php?haberno=')
however it still downloads content from the Article URL
2. The article page has 3 rows of tables and i want the one in the middle
here is an example of the Article: " http://www.radikal.com.tr/haber.php?haberno=253962"
i coppied some lines from The Newyork Times and added --ignore tables--, unfortunately it did no good,
html_description = True
html2lrf_options = ['--ignore-tables']
remove_tags_before = dict(name='img' , attrs='src')
remove_tags_after = dict(id='footer')
remove_tags = [dict(attrs={'class':['articleTools', 'post-tools', 'side_tool']}),
dict(id=['footer', 'table', 'navigation', 'archive', 'side_search', 'blog_sidebar', 'side_tool', 'side_index']),
dict(name=['script', 'noscript'])]
what is it that i am doing wrong?? Thanks
Necator 05-02-2008, 02:26 AM Hi, altough i am a newbee i happen to jump in python language to read my local newspaper. And as expected i need some advice :)
1. i failed to show libprs500 print_version URL so the conted comes from the Article URL,
Article URL :http://www.radikal.com.tr/haber.php?haberno=253962
Print_vesion URL:http://www.radikal.com.tr/yazici.php?haberno=253962
i tried this which failed:
def print_version (self, url):
return url.replace ('http://www.radikal.com.tr/haber.php?haberno=', 'http://www.radikal.com.tr/yazici.php?haberno=')
2. So i get the feed from article and to get the main news body from the HTML i removed the tables but this time i cannot cut the news body from the rest of thepage, i copied the recipe from the manual (The Newyork Times) which again ended up in failiure,
html_description = True
html2lrf_options = ['--ignore-tables']
remove_tags_before = dict(name='img' , attrs='src')
remove_tags_after = dict(id='footer')
remove_tags = [dict(attrs={'class':['articleTools', 'post-tools', 'side_tool']}),
dict(id=['footer', 'table', 'navigation', 'archive', 'side_search', 'blog_sidebar', 'side_tool', 'side_index']),
dict(name=['script', 'noscript'])]
what is it that i do wrong? Please lead me, thanks anyway.....
kovidgoyal 05-02-2008, 05:54 AM To get the print version just use
return url.replace('haber.php', 'yazici.php')
Necator 05-02-2008, 07:29 AM Sorry, still getting the text from Article URL.
" This article is downloaded by Libprs500 from http://radikal.com.tr/haber.php?haberno=254668"
Here is my full recipe if it helps:
title = u'Radikal Gazetesi'
oldest_article = 1
max_articles_per_feed = 15
no_stylesheets = True
extra_css = 'h1 {font: sans-serif large;}\n.byline {font:monospace;}'
html_description = True
html2lrf_options = ['--ignore-tables']
remove_tags_before = dict(name='img' , attrs='src')
remove_tags_after = dict(id='copy')
feeds = [(u'Kose', u'http://www.radikal.com.tr/radikal_yazar.xml')]
def print_version (self, url):
return url.replace ('haber.php', 'yazici.php')
Btw,i am not sure if it matters but the print_version URL is:
http://www.radikal.com.tr/yazici.php?haberno=254484&tarih=01/05/2008&yollayan_sayfa='http%3A%2F%2Fwww.radikal.com.tr%2F haber.php%3Fhaberno%3D254484'
1. print
2. date
3. sending_page
kovidgoyal 05-02-2008, 07:41 AM You have to indent print_version so it is a part of the class. See attached.
Necator 05-02-2008, 10:44 AM Sorry but i cant open the file. i tried opening it with notepad and dzsoft. What should i do??
kovidgoyal 05-02-2008, 10:45 AM use notepad++ (google it)
Necator 05-02-2008, 10:49 AM Yep just got it. i triedwinrar and extract "test.py". winrar didnt see it automatically. Thank you....
And..
It's alive!! thank you so much.
|