Thread: web2lrf
View Single Post
Old 11-25-2007, 08:55 PM   #70
DaveNB
Connoisseur
DaveNB has a complete set of Star Wars action figures.DaveNB has a complete set of Star Wars action figures.DaveNB has a complete set of Star Wars action figures.DaveNB has a complete set of Star Wars action figures.
 
Posts: 86
Karma: 399
Join Date: Jun 2007
Device: Nook, Sony PRS-500, Nokia 770 (FBReader)
Got Wired.com RSS feeds working....

Try this script. Copy the text below the ------ and save/paste it into a file called "wired.py", it'll produce a file:
Wired RSS [25 Nov 2007 1720].lrf (for example).

I think it's producing pretty clean text (most ads, links, banners, comments, cruft are removed) for reading off-line, but there are still some fomatting issues (some fonts too big, others way too small, maybe I need to kill all CSS info in the <header> sections completely?).

BTW, if you make any changes to the user profile wired.py file, before running the web2lrf command, delete the previously generated wired.pyc file or your changes won't be reflected (I think).

Any suggestions for cleaning up the text formatting? Give it a try.

Dave

-------
Code:
# coding: ISO-8859-1
##    Copyright (C) 2007 David Chen SonyReader<at>DaveChen<dot>org
##
##    This program is free software; you can redistribute it and/or modify
##    it under the terms of the GNU General Public License as published by
##    the Free Software Foundation; either version 2 of the License, or
##    (at your option) any later version.
##
##    This program is distributed in the hope that it will be useful,
##    but WITHOUT ANY WARRANTY; without even the implied warranty of
##    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
##    GNU General Public License for more details.
##
##	Version 0.6-2007-11-27
##	Based on newsweek.py, bbc.py, nytimes.py by Kovid Goyal
##	https://libprs500.kovidgoyal.net/wiki/UserProfiles
##
##	Usage:
##	>web2lrf --user-profile wired.py
##	Comment out the RSS feeds you don't want in the last section below
##
##	Output:
##	Wired [YearMonthDate Time].lrf
##
'''
Profile to download RSS News Feeds and Articles from Wired.com
'''

import re

from libprs500.ebooks.lrf.web.profiles import DefaultProfile 
	
class wired(DefaultProfile):
   
	title = 'Wired'
	max_recursions = 2
	timefmt  = ' [%Y%b%d  %H%M]'
	html_description = True
	no_stylesheets = True
	
	## Don't grab articles more than 7 days old
	oldest_article = 7
  
	preprocess_regexps = [(re.compile(i[0], re.IGNORECASE | re.DOTALL), i[1]) for i in 
		[

		## Remove any banners/links/ads/cruft before the body of the article.
		(r'<body.*?((<div id="article_body">)|(<div id="st-page-maincontent">)|(<div id="containermain">)|(<p class="ap-story-p">)|(<!-- img_nav -->))', lambda match: '<body><div>'),

		## Remove any links/ads/comments/cruft from the end of the body of the article.
		(r'((<!-- end article content -->)|(<div id="st-custom-afterpagecontent">)|(<p class="ap-story-p">&copy;)|(<div class="entry-footer">)|(<div id="see_also">)|(<p>Via <a href=)|(<div id="ss_nav">)).*?</html>', lambda match : '</div></body></html>'),

		## Correctly embed in-line images
		(r'<a.*?onclick.*?>.*?(<img .*?>)', lambda match: match.group(1),),

		## Correct the apostrophe character so it renders well in LRF
		(r'’', lambda match: "'"),
		]
	]

## Use the single page Print version of a page when available.
## Not all RSS entries have Print versions, ie. ones hosted on the blog.wired.com URL's

	def print_version(self, url):
		return url.replace('http://www.wired.com/', 'http://www.wired.com/print/')

## Comment out the feeds you don't want retrieved.
## Or add any new new RSS feed URL's here

	def get_feeds(self):
		return	[
		('Top News', 'http://feeds.wired.com/wired/index'),
		('Culture', 'http://feeds.wired.com/wired/culture'),
		('Software', 'http://feeds.wired.com/wired/software'),
		('Mac', 'http://feeds.feedburner.com/cultofmac/bFow'),
		('Gadgets', 'http://feeds.wired.com/wired/gadgets'),
		('Cars', 'http://feeds.wired.com/wired/cars'),
		('Entertainment', 'http://feeds.wired.com/wired/entertainment'),
		('Gaming', 'http://feeds.wired.com/wired/gaming'),
		('Science', 'http://feeds.wired.com/wired/science'),
		('Med Tech', 'http://feeds.wired.com/wired/medtech'),
		('Politics', 'http://feeds.wired.com/wired/politics'),
		('Tech Biz', 'http://feeds.wired.com/wired/techbiz'),
		('Commentary', 'http://feeds.wired.com/wired/commentary'),
		]

Last edited by DaveNB; 11-28-2007 at 02:24 AM. Reason: Updated to v0.6 - Improved font size rendering (by not using the original .css), improved inline image and link name handling
DaveNB is offline   Reply With Quote