Help writing profile to get RSS feed - Page 5

Deputy-Dawg · 03-10-2008, 09:02 PM

Kovid,
In the attached .zip file is the user-profile for one of my local newspapers. It use to work. Now all it gets is the TOC - no articles. What is strange is that the print file addresses are still the same and the error messages when I run it in terminal do not contain any thing that resembles the URL of the print files. I have enclosed a copy of one such run.

My question is has the newspaper changed something or has something changed in lbprs500?

kovidgoyal · 03-10-2008, 09:09 PM

You need to fix the print_version function, the way the feed links to articles seems to have changed.

Deputy-Dawg · 03-10-2008, 09:47 PM

Thats what I thought had happened but the link to the print version of

http://www.nwaonline.net/articles/20...datefiling.txt

is

http://www.nwaonline.net/articles/20...datefiling.prt

which is what I would expect the function as written to return. The only difference I can see, if is different - because I am a bit hazy on how it behaved before, is that the print version opens in a new window. I don't think thats an issue in as much as I have seen others were the print version opened in a new window. Darned if I can put my hands on it though.

kovidgoyal · 03-10-2008, 09:56 PM

The format of the feed itself has changed use

Code:

url_search_order = ['link', 'guid']

Deputy-Dawg · 03-10-2008, 10:31 PM

Thanks, again! that fixed it. But... what sort of landmarks should I have been looking for in the source file if a similar problem occur again. I guess what I am asking for is more generalized solution.

kovidgoyal · 03-10-2008, 10:40 PM

Well the log has a bunch of error messages about not being able to fetch .prt URLs. That's your clue, it means either that the print_version function no longer works or that the feed format has changed, causing the URL being fed to print_version to be wrong. You can check that by stick a

Code:

print url

into print_version

Deputy-Dawg · 03-10-2008, 11:21 PM

Great minds in the same gutter, well almost. What I did was to put

Code:

return url

in and checked the error log. A little sloppier but it works. But by the time I came back to report what I had determined what was going on you had posted the fix. I suppose I should spend a bit of time taking an in depth review of DefaultProfile and see just what more goodies are there. Again thanks!

kovidgoyal · 03-10-2008, 11:56 PM

You should probably hold off for a bit. I'm in the process of re-writing web2lrf to make it much more powerful.

balok · 03-11-2008, 09:00 AM

Quote:

Originally Posted by kovidgoyal

I'm in the process of re-writing web2lrf to make it much more powerful.

What kind of changes, or new features, should we expect? Will it handle current custom profiles, or will they need to be rewritten?

kovidgoyal · 03-11-2008, 11:34 AM

It will handle current profiles, but in any case the old web2lrf code will remain for a long time, so no need to worry.

It will be multithreaded, handle many different feed formats, have a much more powerful and easy to use preprocessing engine, so you dont have to use regexps, unless you want to. Eventually, it should be smart enough that if you give it just the URL to a feed, it will go a fetch a reasonably sanitized version of the articles.

EDIT: Oh and I forgot that it will have links at the end of each article back to the table of contents

balok · 03-12-2008, 08:17 AM

Quote:

Originally Posted by kovidgoyal

It will handle current profiles, but in any case the old web2lrf code will remain for a long time, so no need to worry.

It will be multithreaded, handle many different feed formats, have a much more powerful and easy to use preprocessing engine, so you dont have to use regexps, unless you want to. Eventually, it should be smart enough that if you give it just the URL to a feed, it will go a fetch a reasonably sanitized version of the articles.

EDIT: Oh and I forgot that it will have links at the end of each article back to the table of contents

All of that sounds really cool. A link to the table of contents, in particular, seems like a no brainer, but I never thought of it. It would be nice if the link would bring you to the contents of the current rss feed (and not the first level table of contents). That way if you're reading say international news, you can stay in that section.

kovidgoyal · 03-12-2008, 12:30 PM

Quote:

Originally Posted by balok

All of that sounds really cool. A link to the table of contents, in particular, seems like a no brainer, but I never thought of it. It would be nice if the link would bring you to the contents of the current rss feed (and not the first level table of contents). That way if you're reading say international news, you can stay in that section.

There's an up one level, up two levels and next and previous links.

DaleDe · 03-19-2008, 02:08 PM

Quote:

Originally Posted by balok

Deputy-Dawg, are you really 74? I've never met a person over 50 who can handle a computer beyond pointing and clicking with difficulty. You must have been a professor or an engineer during your working career.

You need to get out more.

dale

Necator · 05-02-2008, 03:06 AM

Hi, i have some difficulties on
1.making libprs500 see the printable_version URL correctly
2removing the tables.
i would appretiate if you lead me.

1.
Article URL : http://www.radikal.com.tr/haber.php?haberno=XXXXX
Printable URL: http://www.radikal.com.tr/yazici.php?haberno=XXXXX

i tried usning this:
def print_version (self, url):
return url.replace ('http://www.radikal.com.tr/haber.php?haberno=', 'http://www.radikal.com.tr/yazici.php?haberno=')

however it still downloads content from the Article URL

2. The article page has 3 rows of tables and i want the one in the middle
here is an example of the Article: " http://www.radikal.com.tr/haber.php?haberno=253962"

i coppied some lines from The Newyork Times and added --ignore tables--, unfortunately it did no good,
html_description = True
html2lrf_options = ['--ignore-tables']
remove_tags_before = dict(name='img' , attrs='src')
remove_tags_after = dict(id='footer')
remove_tags = [dict(attrs={'class':['articleTools', 'post-tools', 'side_tool']}),
dict(id=['footer', 'table', 'navigation', 'archive', 'side_search', 'blog_sidebar', 'side_tool', 'side_index']),
dict(name=['script', 'noscript'])]

what is it that i am doing wrong?? Thanks

Necator · 05-02-2008, 03:26 AM

Hi, altough i am a newbee i happen to jump in python language to read my local newspaper. And as expected i need some advice

1. i failed to show libprs500 print_version URL so the conted comes from the Article URL,

Article URL :http://www.radikal.com.tr/haber.php?haberno=253962
Print_vesion URL:http://www.radikal.com.tr/yazici.php?haberno=253962

i tried this which failed:
def print_version (self, url):
return url.replace ('http://www.radikal.com.tr/haber.php?haberno=', 'http://www.radikal.com.tr/yazici.php?haberno=')

2. So i get the feed from article and to get the main news body from the HTML i removed the tables but this time i cannot cut the news body from the rest of thepage, i copied the recipe from the manual (The Newyork Times) which again ended up in failiure,
html_description = True
html2lrf_options = ['--ignore-tables']
remove_tags_before = dict(name='img' , attrs='src')
remove_tags_after = dict(id='footer')
remove_tags = [dict(attrs={'class':['articleTools', 'post-tools', 'side_tool']}),
dict(id=['footer', 'table', 'navigation', 'archive', 'side_search', 'blog_sidebar', 'side_tool', 'side_index']),
dict(name=['script', 'noscript'])]

what is it that i do wrong? Please lead me, thanks anyway.....

03-10-2008, 09:56 PM	#64
kovidgoyal creator of calibre Posts: 45,146 Karma: 27110894 Join Date: Oct 2006 Location: Mumbai, India Device: Various	The format of the feed itself has changed use Code: url_search_order = ['link', 'guid']

03-10-2008, 10:40 PM	#66
kovidgoyal creator of calibre Posts: 45,146 Karma: 27110894 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Well the log has a bunch of error messages about not being able to fetch .prt URLs. That's your clue, it means either that the print_version function no longer works or that the feed format has changed, causing the URL being fed to print_version to be wrong. You can check that by stick a Code: print url into print_version

03-10-2008, 11:21 PM	#67
Deputy-Dawg Groupie Posts: 153 Karma: 799 Join Date: Dec 2007 Device: sony prs505	Great minds in the same gutter, well almost. What I did was to put Code: return url in and checked the error log. A little sloppier but it works. But by the time I came back to report what I had determined what was going on you had posted the fix. I suppose I should spend a bit of time taking an in depth review of DefaultProfile and see just what more goodies are there. Again thanks! Last edited by Deputy-Dawg; 03-10-2008 at 11:26 PM.

03-11-2008, 11:34 AM	#70
kovidgoyal creator of calibre Posts: 45,146 Karma: 27110894 Join Date: Oct 2006 Location: Mumbai, India Device: Various	It will handle current profiles, but in any case the old web2lrf code will remain for a long time, so no need to worry. It will be multithreaded, handle many different feed formats, have a much more powerful and easy to use preprocessing engine, so you dont have to use regexps, unless you want to. Eventually, it should be smart enough that if you give it just the URL to a feed, it will go a fetch a reasonably sanitized version of the articles. EDIT: Oh and I forgot that it will have links at the end of each article back to the table of contents Last edited by kovidgoyal; 03-11-2008 at 11:40 AM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
RSS Feed	timezone	Feedback	8	01-02-2010 06:55 PM
RSS Feed questions	rambling	Calibre	2	11-20-2008 05:35 AM
Working User Profile for Wired.com RSS feeds for libprs500	DaveNB	Calibre	6	11-30-2007 07:00 AM
RSS Feed Updates	Alexander Turcic	Announcements	0	06-11-2004 04:11 PM

03-10-2008, 09:09 PM	#62
kovidgoyal creator of calibre Posts: 45,146 Karma: 27110894 Join Date: Oct 2006 Location: Mumbai, India Device: Various	You need to fix the print_version function, the way the feed links to articles seems to have changed.

03-10-2008, 09:47 PM	#63
Deputy-Dawg Groupie Posts: 153 Karma: 799 Join Date: Dec 2007 Device: sony prs505	Thats what I thought had happened but the link to the print version of http://www.nwaonline.net/articles/20...datefiling.txt is http://www.nwaonline.net/articles/20...datefiling.prt which is what I would expect the function as written to return. The only difference I can see, if is different - because I am a bit hazy on how it behaved before, is that the print version opens in a new window. I don't think thats an issue in as much as I have seen others were the print version opened in a new window. Darned if I can put my hands on it though.

03-10-2008, 10:31 PM	#65
Deputy-Dawg Groupie Posts: 153 Karma: 799 Join Date: Dec 2007 Device: sony prs505	Thanks, again! that fixed it. But... what sort of landmarks should I have been looking for in the source file if a similar problem occur again. I guess what I am asking for is more generalized solution.

03-10-2008, 11:56 PM	#68
kovidgoyal creator of calibre Posts: 45,146 Karma: 27110894 Join Date: Oct 2006 Location: Mumbai, India Device: Various	You should probably hold off for a bit. I'm in the process of re-writing web2lrf to make it much more powerful.

05-02-2008, 03:06 AM	#74
Necator Junior Member Posts: 5 Karma: 10 Join Date: Apr 2008 Device: PRS-505	Hi, i have some difficulties on 1.making libprs500 see the printable_version URL correctly 2removing the tables. i would appretiate if you lead me. 1. Article URL : http://www.radikal.com.tr/haber.php?haberno=XXXXX Printable URL: http://www.radikal.com.tr/yazici.php?haberno=XXXXX i tried usning this: def print_version (self, url): return url.replace ('http://www.radikal.com.tr/haber.php?haberno=', 'http://www.radikal.com.tr/yazici.php?haberno=') however it still downloads content from the Article URL 2. The article page has 3 rows of tables and i want the one in the middle here is an example of the Article: " http://www.radikal.com.tr/haber.php?haberno=253962" i coppied some lines from The Newyork Times and added --ignore tables--, unfortunately it did no good, html_description = True html2lrf_options = ['--ignore-tables'] remove_tags_before = dict(name='img' , attrs='src') remove_tags_after = dict(id='footer') remove_tags = [dict(attrs={'class':['articleTools', 'post-tools', 'side_tool']}), dict(id=['footer', 'table', 'navigation', 'archive', 'side_search', 'blog_sidebar', 'side_tool', 'side_index']), dict(name=['script', 'noscript'])] what is it that i am doing wrong?? Thanks

05-02-2008, 03:26 AM	#75
Necator Junior Member Posts: 5 Karma: 10 Join Date: Apr 2008 Device: PRS-505	Hi, altough i am a newbee i happen to jump in python language to read my local newspaper. And as expected i need some advice 1. i failed to show libprs500 print_version URL so the conted comes from the Article URL, Article URL :http://www.radikal.com.tr/haber.php?haberno=253962 Print_vesion URL:http://www.radikal.com.tr/yazici.php?haberno=253962 i tried this which failed: def print_version (self, url): return url.replace ('http://www.radikal.com.tr/haber.php?haberno=', 'http://www.radikal.com.tr/yazici.php?haberno=') 2. So i get the feed from article and to get the main news body from the HTML i removed the tables but this time i cannot cut the news body from the rest of thepage, i copied the recipe from the manual (The Newyork Times) which again ended up in failiure, html_description = True html2lrf_options = ['--ignore-tables'] remove_tags_before = dict(name='img' , attrs='src') remove_tags_after = dict(id='footer') remove_tags = [dict(attrs={'class':['articleTools', 'post-tools', 'side_tool']}), dict(id=['footer', 'table', 'navigation', 'archive', 'side_search', 'blog_sidebar', 'side_tool', 'side_index']), dict(name=['script', 'noscript'])] what is it that i do wrong? Please lead me, thanks anyway.....

Advert

Advert