web2lrf - Page 22

Ben_B · 04-28-2008, 01:37 AM

Thanks... I wasn't aware that this changed. This may take me awhile as I learn how to write "recipes". Tried making some quick changes using the new recipe format (BasicNewsRecipe), but I must be doing something wrong as I consistently receive the following error...

IndexError: list index out of range
Failed to perform job: Fetch news from The Globe and Mail
Detailed traceback:
Traceback (most recent call last):
File "parallel.py", line 139, in run_job
File "libprs500\ebooks\lrf\feeds\convert_from.pyo", line 40, in main
File "libprs500\web\feeds\main.pyo", line 134, in run_recipe
File "libprs500\web\feeds\news.pyo", line 466, in download
File "libprs500\web\feeds\news.pyo", line 603, in build_index
File "d:\temp\libprs500_0.4.49_r_7fws_recipes\recipe0.p y", line 39, in print_version
IndexError: list index out of range

Bubble · 05-03-2008, 11:01 PM

Hope you guys updated to the newest version! Globe n Mail is now supported in calibre. I have not looked at it in details yet however due to other priorities.

Thanks kovidgoyal.

moneytoo · 05-08-2008, 03:46 PM

Code:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe8 in position 113: ordinal not in range(128)
Failed to perform job: Fetch news from Reuters
Detailed traceback:
Traceback (most recent call last):
  File "parallel.py", line 139, in run_job
  File "calibre\ebooks\lrf\feeds\convert_from.pyo", line 40, in main
  File "calibre\web\feeds\main.pyo", line 128, in run_recipe
  File "calibre\web\feeds\news.pyo", line 810, in __init__
  File "calibre\ebooks\lrf\web\profiles\__init__.pyo", line 174, in __init__
  File "calibre\ebooks\lrf\web\profiles\__init__.pyo", line 225, in build_index
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe8 in position 113: ordinal not in range(128)
Log:
Fetching feeds...

I cannot convert single news feed using calibre GUI nor web2lrf. Every time I get this UnicodeDecodeError no matter what site it parses.

kovidgoyal · 05-08-2008, 05:27 PM

Try the next release, it has a possible fix for this. It should be out in a couple of days.

Rick C · 05-09-2008, 12:48 AM

I have been using v4.51 for a couple of days and the Globe feed is working well for me, athough it only retrieves the first page of any given story.

kovidgoyal · 05-09-2008, 01:17 PM

That's probably because it needs a subscription, which I don't have. I actually wrote that recipe as a guide for Bubble, in the hopes he'd improve it and share the result.

Bubble · 05-10-2008, 12:48 AM

I notice that too Rick C when I finally got around to test it.

The link that I had for Globe and Mail profile is broken (from private message). The online helpfile for web2lrf also point to a broken link when attempting to browse the default profiles. When you have the time, could you please take a look at it kovidgoyal?

I still have a faint image of the profile when I first saw it. To be honest, the codes are way above my understanding at this point in time. As such, I doubt I can tweak it to perfection... But maybe Ben_B can?

kovidgoyal · 05-10-2008, 01:18 AM

Fixed the links.

Ben_B · 05-22-2008, 01:31 AM

As for the links to the full stories from the Globe and Mail, I was using the following function to retrieve the full stories from the Globe Investor web site in the profile I posted earlier. The Globe Investor produces a very nice printed version without any extra HTML. I was using the function to created printed versions of the news stories from the Globe and Mail RSS feeds (i.e., http://www.theglobeandmail.com/gener...s/BN/Front.xml).

def print_version(self, url):
return 'http://www.globeinvestor.com/servlet/ArticleNews/print/' + (url.split('/story/',1)[1]).split('.',1)[0] + '/' + url.rsplit('.',3)[2] + '/' + url.rsplit('.',3)[3]

The problem I ran into is that most of the full stories are contained within the tag <feedburner

rigLink>. With the old libprs500, I was usng url_search_order = ['feedburner

riglink']. This seemed to work; however, this variable no longer seems to exist in Calibre's Basic News Recipe. I can't seem to figure out how to make Calibre follow the links contained within the <feedburner

rigLink> tags. I'm guessing I will need to process this somehow through another function?

kovidgoyal · 05-22-2008, 11:44 AM

Yeah

Code:

   def get_article_url(self, article):
        return article.get('feedburner_origlink', None)

Ben_B · 05-23-2008, 02:41 PM

Here is my personal profile for the Globe and Mail I use for my PRS-505. I'm not a coder so there is probably plenty of room for improvement. The only problem I have is that I cannot change the text size while viewing it on the Reader. When opening the e-book file, the Reader defaults to S sized text. Attempting to change the size to M or L causes my Reader to crash and restart. My firmware is ver. 1.0.00.08130.

Code:

import re

from calibre.web.feeds.news import BasicNewsRecipe

class GlobeMail(BasicNewsRecipe): 

	title = 'The Globe and Mail' 
	html_description = False
	use_pubdate = True
	oldest_article = 7
	use_embedded_content = False
	max_articles_per_feed = 10
	simultaneous_downloads = 1
	no_stylesheets = True
	summary_length = 300
	html2lrf_options = ['--base-font-size', '9'] 

	preprocess_regexps =  [
		
		(re.compile(r'<script.*?</script>', re.IGNORECASE | re.DOTALL), lambda match : ' '),
		(re.compile(r'<style.*?</style>', re.IGNORECASE | re.DOTALL), lambda match : '<style> </style>'),
		(re.compile(r'<body class="subscribe.*?<div id="articleAbstract">', re.IGNORECASE | re.DOTALL), lambda match : '<body><div>'),
		(re.compile(r'<ul class="columnistInfo">.*?</ul>', re.IGNORECASE | re.DOTALL), lambda match : ''),
		(re.compile(r'<p class="note".*?</body>', re.IGNORECASE | re.DOTALL), lambda match : '<br><br>Subscription required to read full story</body>'),
		(re.compile(r'<p class="deck"></p>', re.IGNORECASE | re.DOTALL), lambda match : ' '),
		(re.compile(r'<p class="byline"></p>', re.IGNORECASE | re.DOTALL), lambda match : ' '),
		(re.compile(r'<p class="date"></p>', re.IGNORECASE | re.DOTALL), lambda match : ' '),
		(re.compile(r'<p><a href="http://www.globeinvestor.com/">.*?<h2', re.IGNORECASE | re.DOTALL), lambda match : '<h2'),
		(re.compile(r'<h1 class="keyline">.*?</h1>', re.IGNORECASE | re.DOTALL), lambda match : ' '),
		(re.compile(r'<p class="date">.*?<(\S+)>', re.IGNORECASE | re.DOTALL), lambda match : match.group().replace(match.group(1), '/p><br') ),
		(re.compile(r'<a href.*? target="offsite">', re.IGNORECASE | re.DOTALL), lambda match : '<a name="#">'),
		(re.compile(r'<tr>', re.IGNORECASE | re.DOTALL), lambda match : '<br>'),
		(re.compile(r'<td>', re.IGNORECASE | re.DOTALL), lambda match : ' '),
		(re.compile(r'</tr>', re.IGNORECASE | re.DOTALL), lambda match : ' '),
		(re.compile(r'</td>', re.IGNORECASE | re.DOTALL), lambda match : '  '),
		(re.compile(r'<hr>', re.IGNORECASE | re.DOTALL), lambda match : ' '),
		(re.compile(r'<!-- /frag.../copyright begins -->', re.IGNORECASE | re.DOTALL), lambda match : '<br><!-- /frag.../copyright begins --><br>'),
		]

	def get_article_url(self, article):
		return article.get('feedburner_origlink', article.link)

	def print_version(self, url): 
		return 'http://www.globeinvestor.com/servlet/ArticleNews/print/' + (url.split('/story/',1)[1]).split('.',1)[0] + '/' + url.rsplit('.',3)[2] + '/' + url.rsplit('.',3)[3]

	def get_feeds(self):
		return [
		('  A. Front Page', 'http://www.theglobeandmail.com/generated/rss/BN/Front.xml'),
		('  B. British Columbia', 'http://www.theglobeandmail.com/generated/rss/BN/HYBritishColumbia.xml'),
		('  C. National', 'http://www.theglobeandmail.com/generated/rss/BN/National.xml'),
		('  D. World', 'http://www.theglobeandmail.com/generated/rss/BN/International.xml'),
		('  E. Americas', 'http://www.theglobeandmail.com/generated/rss/BN/HYAmerica.xml'),
		('  F. Report on Business', 'http://www.theglobeandmail.com/generated/rss/BN/Business.xml'),
		('  G. Energy News', 'http://www.theglobeandmail.com/generated/rss/BN/energy.xml'),
		('  H. Your Money', 'http://www.theglobeandmail.com/generated/rss/BN/SpecialEvents2.xml'),
		('  I. Sports', 'http://www.theglobeandmail.com/generated/rss/BN/Sports.xml'),
		('  J. The Arts', 'http://www.theglobeandmail.com/generated/rss/BN/Entertainment.xml'),
		('  K. Movies', 'http://www.theglobeandmail.com/generated/rss/BN/HYMovies.xml'),
		('  L. Music', 'http://www.theglobeandmail.com/generated/rss/BN/HYMusic.xml'),
		('  M. Technology', 'http://www.theglobeandmail.com/generated/rss/BN/Technology.xml'),
		('  N. Science', 'http://www.theglobeandmail.com/generated/rss/BN/Science.xml'),
		('  O. Life', 'http://www.theglobeandmail.com/generated/rss/BN/lifeMain.xml'),
		('  P. Food & Wine', 'http://www.theglobeandmail.com/generated/rss/BN/lifeFoodWine.xml'),
		('  Q. Travel', 'http://www.theglobeandmail.com/generated/rss/BN/specialTravel.xml'),
		('  R. Health', 'http://www.theglobeandmail.com/generated/rss/BN/specialScienceandHealth.xml'),
		]

kovidgoyal · 05-23-2008, 02:50 PM

yeah the font size thing is a bug in SONY's firmware, which hopefully they will fix. Are the articles the full length ones? Or do you need a subscription for that?

Ben_B · 05-23-2008, 03:19 PM

I'd say at least 90% of the articles are full-length. Most of the subscription articles are movie or restaurant reviews. I did a quick review of the articles I downloaded this morning...

A Front Page = 9/9 are full length
B British Columbia = 8/10 full length
C National = 10/10 full length
D World = 10/10 full length
E Americas = 10/10 full length

I didn't go through the rest, but I do recall seeing a couple more subscription articles under Movies.

moneytoo · 05-30-2008, 08:18 AM

I have waited few weeks and downloaded latest version of calibre today. Just tried fetching few feeds but most of them just doesnt work...

Code:

Associated Press		UnicodeDecodeError
The Atlantic			OK
The BBC			OK
Business Week			URLError
CNN				UnicodeDecodeError
Christian Science Monitor	UnicodeDecodeError
Die Zeit Nachrichten		UnicodeDecodeError
The Economist			OK
FAZ NET			UnicodeDecodeError
Globe and Mail			OK
Jerusalem Post			UnicodeDecodeError
Jutarnji				UnicodeDecodeError
NASA				UnicodeDecodeError
New York Review of Books	UnicodeDecodeError
The New Yorker			UnicodeDecodeError
Newsweek			OK
Outlook Inida			OK
Portfolio			OK
Reuters				UnicodeDecodeError
Spiegel Online			UnicodeDecodeError
Syndey Morning Herald		OK
USA Today			OK
United Press International	UnicodeDecodeError
Washington Post		UnicodeDecodeError
Wired.com			OK

Unfortunately I still have difficulties converting sites using web2lrf...

Code:

c:\Program Files\calibre>web2lrf -u http://www.mobilmania.mobi -r 1 default
Downloading
. . .Could not fetch stylesheet http://klub.zive.cz/passport/ /Client.StyleSheet
s/common.css
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . .

http://www.mobilmania.mobi saved to c:\docume~1\marcel~1\locals~1\temp\calibre_w
seyry_web2lrf\index.html
Traceback (most recent call last):
  File "convert_from.py", line 182, in <module>
  File "convert_from.py", line 176, in main
  File "convert_from.py", line 146, in process_profile
  File "ntpath.pyo", line 102, in join
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe1 in position 19: ordinal
 not in range(128)

kovidgoyal · 05-30-2008, 11:34 AM

I assume you're using a localized (non-english) version of windows?

05-22-2008, 01:31 AM	#324
Ben_B Junior Member Posts: 7 Karma: 10 Join Date: Apr 2008 Location: British Columbia, Canada Device: Sony PRS-505	As for the links to the full stories from the Globe and Mail, I was using the following function to retrieve the full stories from the Globe Investor web site in the profile I posted earlier. The Globe Investor produces a very nice printed version without any extra HTML. I was using the function to created printed versions of the news stories from the Globe and Mail RSS feeds (i.e., http://www.theglobeandmail.com/gener...s/BN/Front.xml). def print_version(self, url): return 'http://www.globeinvestor.com/servlet/ArticleNews/print/' + (url.split('/story/',1)[1]).split('.',1)[0] + '/' + url.rsplit('.',3)[2] + '/' + url.rsplit('.',3)[3] The problem I ran into is that most of the full stories are contained within the tag <feedburnerrigLink>. With the old libprs500, I was usng url_search_order = ['feedburnerriglink']. This seemed to work; however, this variable no longer seems to exist in Calibre's Basic News Recipe. I can't seem to figure out how to make Calibre follow the links contained within the <feedburnerrigLink> tags. I'm guessing I will need to process this somehow through another function?

05-22-2008, 11:44 AM	#325
kovidgoyal creator of calibre Posts: 43,858 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Yeah Code: def get_article_url(self, article): return article.get('feedburner_origlink', None)

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
web2lrf to capture blog archive?	Deputy-Dawg	Sony Reader Dev Corner	1	02-14-2008 11:41 PM
web2lrf: La Repubblica	alexxxm	Sony Reader	1	11-13-2007 12:27 PM

04-28-2008, 01:37 AM	#316
Ben_B Junior Member Posts: 7 Karma: 10 Join Date: Apr 2008 Location: British Columbia, Canada Device: Sony PRS-505	Thanks... I wasn't aware that this changed. This may take me awhile as I learn how to write "recipes". Tried making some quick changes using the new recipe format (BasicNewsRecipe), but I must be doing something wrong as I consistently receive the following error... IndexError: list index out of range Failed to perform job: Fetch news from The Globe and Mail Detailed traceback: Traceback (most recent call last): File "parallel.py", line 139, in run_job File "libprs500\ebooks\lrf\feeds\convert_from.pyo", line 40, in main File "libprs500\web\feeds\main.pyo", line 134, in run_recipe File "libprs500\web\feeds\news.pyo", line 466, in download File "libprs500\web\feeds\news.pyo", line 603, in build_index File "d:\temp\libprs500_0.4.49_r_7fws_recipes\recipe0.p y", line 39, in print_version IndexError: list index out of range

05-03-2008, 11:01 PM	#317
Bubble Enthusiast Posts: 32 Karma: 274 Join Date: Apr 2008 Device: Sony Reader PRS-500	Hope you guys updated to the newest version! Globe n Mail is now supported in calibre. I have not looked at it in details yet however due to other priorities. Thanks kovidgoyal.

05-08-2008, 05:27 PM	#319
kovidgoyal creator of calibre Posts: 43,858 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Try the next release, it has a possible fix for this. It should be out in a couple of days.

05-09-2008, 12:48 AM	#320
Rick C Seeker Posts: 53 Karma: 363 Join Date: Mar 2008 Location: Ontario, Canada Device: Sony PRS-505	I have been using v4.51 for a couple of days and the Globe feed is working well for me, athough it only retrieves the first page of any given story.

05-09-2008, 01:17 PM	#321
kovidgoyal creator of calibre Posts: 43,858 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	That's probably because it needs a subscription, which I don't have. I actually wrote that recipe as a guide for Bubble, in the hopes he'd improve it and share the result.

05-10-2008, 12:48 AM	#322
Bubble Enthusiast Posts: 32 Karma: 274 Join Date: Apr 2008 Device: Sony Reader PRS-500	I notice that too Rick C when I finally got around to test it. The link that I had for Globe and Mail profile is broken (from private message). The online helpfile for web2lrf also point to a broken link when attempting to browse the default profiles. When you have the time, could you please take a look at it kovidgoyal? I still have a faint image of the profile when I first saw it. To be honest, the codes are way above my understanding at this point in time. As such, I doubt I can tweak it to perfection... But maybe Ben_B can?

05-10-2008, 01:18 AM	#323
kovidgoyal creator of calibre Posts: 43,858 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Fixed the links.

05-23-2008, 02:50 PM	#327
kovidgoyal creator of calibre Posts: 43,858 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	yeah the font size thing is a bug in SONY's firmware, which hopefully they will fix. Are the articles the full length ones? Or do you need a subscription for that?

05-23-2008, 03:19 PM	#328
Ben_B Junior Member Posts: 7 Karma: 10 Join Date: Apr 2008 Location: British Columbia, Canada Device: Sony PRS-505	I'd say at least 90% of the articles are full-length. Most of the subscription articles are movie or restaurant reviews. I did a quick review of the articles I downloaded this morning... A Front Page = 9/9 are full length B British Columbia = 8/10 full length C National = 10/10 full length D World = 10/10 full length E Americas = 10/10 full length I didn't go through the rest, but I do recall seeing a couple more subscription articles under Movies.

05-30-2008, 11:34 AM	#330
kovidgoyal creator of calibre Posts: 43,858 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	I assume you're using a localized (non-english) version of windows?

Advert

Advert