Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Formats > Other formats > LRF

Notices

Reply
 
Thread Tools Search this Thread
Old 11-22-2007, 11:31 PM   #61
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,835
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Yeah you'd have to figure out the arguments to web2disk that the BBC profile uses from the source code and pass them manually using the commandline.
kovidgoyal is offline   Reply With Quote
Old 11-23-2007, 10:44 PM   #62
veshman
Member
veshman began at the beginning.
 
Posts: 13
Karma: 10
Join Date: Nov 2007
Device: Sony 505
I'm trying to write a converter for Wired magazine. I am totally new to python...how can I add the /print/ into the following URL?


http://www.wired.com/gadgets/digital...rning_question

http://www.wired.com/print/gadgets/digitalcameras/magazine/test2007/dc_burning_question

I'm thinking something like this might work....but I don't know how to make the latter part of the URL a variable that I can put back into the string.

return url.replace('wired.com/?', 'wired.com/print/?')

thanks,

bhavesh
veshman is offline   Reply With Quote
Advert
Old 11-24-2007, 05:18 AM   #63
FixB
Groupie
FixB has a complete set of Star Wars action figures.FixB has a complete set of Star Wars action figures.FixB has a complete set of Star Wars action figures.FixB has a complete set of Star Wars action figures.FixB has a complete set of Star Wars action figures.
 
FixB's Avatar
 
Posts: 186
Karma: 499
Join Date: Oct 2007
Location: France, Toulouse
Device: Sony PRS500
Sorry veshman : I'm having the same difficulties here on some french rss
I would have thought your suggestion should work. Maybe you don't need the "?" as you just replace wired.com with wired.com/print ?
By the way, do someone know how I can keep (and access) the intermediate html files when using web2lrf, so that I could see exactly where my use of regular expressions is faulty ??
FixB is offline   Reply With Quote
Old 11-24-2007, 05:27 AM   #64
FixB
Groupie
FixB has a complete set of Star Wars action figures.FixB has a complete set of Star Wars action figures.FixB has a complete set of Star Wars action figures.FixB has a complete set of Star Wars action figures.FixB has a complete set of Star Wars action figures.
 
FixB's Avatar
 
Posts: 186
Karma: 499
Join Date: Oct 2007
Location: France, Toulouse
Device: Sony PRS500
I tried and it seems that :
Quote:
def print_version(self, url):
return url.replace('wired.com','wired.com/print')
works correctly.
But strangely, not for all articles. The first one seems ok, but the second one is in the 'normale' format... strange
FixB is offline   Reply With Quote
Old 11-24-2007, 09:29 AM   #65
veshman
Member
veshman began at the beginning.
 
Posts: 13
Karma: 10
Join Date: Nov 2007
Device: Sony 505
perhaps it should be:

return url.replace('wired.com','wired.com/print/')

with a second "/"

I'll give it a try.

Also, any thoughts on how to keep web2lrf from pursuing external links (e.g. ads)?

thanks,

bhavesh
veshman is offline   Reply With Quote
Advert
Old 11-24-2007, 09:43 AM   #66
veshman
Member
veshman began at the beginning.
 
Posts: 13
Karma: 10
Join Date: Nov 2007
Device: Sony 505
so I'm getting the URL to appear correctly using the url.replace function, but for some reason, web2lrf can't process the link.

Quote:
Processing category6.html
Parsing HTML...
Converting to BBeB...
Could not follow link to http://www.wired.com/print/science/d...1/st_alphageek
If I just copy and paste the URL into a web browser, it works fine.

Bhavesh
veshman is offline   Reply With Quote
Old 11-24-2007, 10:22 AM   #67
veshman
Member
veshman began at the beginning.
 
Posts: 13
Karma: 10
Join Date: Nov 2007
Device: Sony 505
using the url.replace code did work with the addtion of the "/" but web2lrf was unable to find the link, even though it created it correctly.

meaning, if I copy and paste the link that web2lrf is trying to get into a browser, it works fine.
veshman is offline   Reply With Quote
Old 11-24-2007, 10:25 AM   #68
veshman
Member
veshman began at the beginning.
 
Posts: 13
Karma: 10
Join Date: Nov 2007
Device: Sony 505
on the exclude links front, I tried adding an operator to the script, but so far haven't figured it out.

link-exclude = [^wired]

or
link-exclude = ^w^i^r^e^d
or
link-exclude = *[^wired]*

and a number of other failed attempts that give me a syntax error.
veshman is offline   Reply With Quote
Old 11-24-2007, 12:15 PM   #69
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,835
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
I'm on my thanksgiving break right now, so I can't help in detail, but you may find this page helpful

http://docs.python.org/lib/re-syntax.html
kovidgoyal is offline   Reply With Quote
Old 11-25-2007, 08:55 PM   #70
DaveNB
Connoisseur
DaveNB has a complete set of Star Wars action figures.DaveNB has a complete set of Star Wars action figures.DaveNB has a complete set of Star Wars action figures.DaveNB has a complete set of Star Wars action figures.
 
Posts: 86
Karma: 399
Join Date: Jun 2007
Device: Nook, Sony PRS-500, Nokia 770 (FBReader)
Got Wired.com RSS feeds working....

Try this script. Copy the text below the ------ and save/paste it into a file called "wired.py", it'll produce a file:
Wired RSS [25 Nov 2007 1720].lrf (for example).

I think it's producing pretty clean text (most ads, links, banners, comments, cruft are removed) for reading off-line, but there are still some fomatting issues (some fonts too big, others way too small, maybe I need to kill all CSS info in the <header> sections completely?).

BTW, if you make any changes to the user profile wired.py file, before running the web2lrf command, delete the previously generated wired.pyc file or your changes won't be reflected (I think).

Any suggestions for cleaning up the text formatting? Give it a try.

Dave

-------
Code:
# coding: ISO-8859-1
##    Copyright (C) 2007 David Chen SonyReader<at>DaveChen<dot>org
##
##    This program is free software; you can redistribute it and/or modify
##    it under the terms of the GNU General Public License as published by
##    the Free Software Foundation; either version 2 of the License, or
##    (at your option) any later version.
##
##    This program is distributed in the hope that it will be useful,
##    but WITHOUT ANY WARRANTY; without even the implied warranty of
##    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
##    GNU General Public License for more details.
##
##	Version 0.6-2007-11-27
##	Based on newsweek.py, bbc.py, nytimes.py by Kovid Goyal
##	https://libprs500.kovidgoyal.net/wiki/UserProfiles
##
##	Usage:
##	>web2lrf --user-profile wired.py
##	Comment out the RSS feeds you don't want in the last section below
##
##	Output:
##	Wired [YearMonthDate Time].lrf
##
'''
Profile to download RSS News Feeds and Articles from Wired.com
'''

import re

from libprs500.ebooks.lrf.web.profiles import DefaultProfile 
	
class wired(DefaultProfile):
   
	title = 'Wired'
	max_recursions = 2
	timefmt  = ' [%Y%b%d  %H%M]'
	html_description = True
	no_stylesheets = True
	
	## Don't grab articles more than 7 days old
	oldest_article = 7
  
	preprocess_regexps = [(re.compile(i[0], re.IGNORECASE | re.DOTALL), i[1]) for i in 
		[

		## Remove any banners/links/ads/cruft before the body of the article.
		(r'<body.*?((<div id="article_body">)|(<div id="st-page-maincontent">)|(<div id="containermain">)|(<p class="ap-story-p">)|(<!-- img_nav -->))', lambda match: '<body><div>'),

		## Remove any links/ads/comments/cruft from the end of the body of the article.
		(r'((<!-- end article content -->)|(<div id="st-custom-afterpagecontent">)|(<p class="ap-story-p">&copy;)|(<div class="entry-footer">)|(<div id="see_also">)|(<p>Via <a href=)|(<div id="ss_nav">)).*?</html>', lambda match : '</div></body></html>'),

		## Correctly embed in-line images
		(r'<a.*?onclick.*?>.*?(<img .*?>)', lambda match: match.group(1),),

		## Correct the apostrophe character so it renders well in LRF
		(r'’', lambda match: "'"),
		]
	]

## Use the single page Print version of a page when available.
## Not all RSS entries have Print versions, ie. ones hosted on the blog.wired.com URL's

	def print_version(self, url):
		return url.replace('http://www.wired.com/', 'http://www.wired.com/print/')

## Comment out the feeds you don't want retrieved.
## Or add any new new RSS feed URL's here

	def get_feeds(self):
		return	[
		('Top News', 'http://feeds.wired.com/wired/index'),
		('Culture', 'http://feeds.wired.com/wired/culture'),
		('Software', 'http://feeds.wired.com/wired/software'),
		('Mac', 'http://feeds.feedburner.com/cultofmac/bFow'),
		('Gadgets', 'http://feeds.wired.com/wired/gadgets'),
		('Cars', 'http://feeds.wired.com/wired/cars'),
		('Entertainment', 'http://feeds.wired.com/wired/entertainment'),
		('Gaming', 'http://feeds.wired.com/wired/gaming'),
		('Science', 'http://feeds.wired.com/wired/science'),
		('Med Tech', 'http://feeds.wired.com/wired/medtech'),
		('Politics', 'http://feeds.wired.com/wired/politics'),
		('Tech Biz', 'http://feeds.wired.com/wired/techbiz'),
		('Commentary', 'http://feeds.wired.com/wired/commentary'),
		]

Last edited by DaveNB; 11-28-2007 at 02:24 AM. Reason: Updated to v0.6 - Improved font size rendering (by not using the original .css), improved inline image and link name handling
DaveNB is offline   Reply With Quote
Old 11-26-2007, 10:52 AM   #71
veshman
Member
veshman began at the beginning.
 
Posts: 13
Karma: 10
Join Date: Nov 2007
Device: Sony 505
Dave,

thanks! i'll give it a try and post my results.

bhavesh
veshman is offline   Reply With Quote
Old 11-26-2007, 10:55 AM   #72
veshman
Member
veshman began at the beginning.
 
Posts: 13
Karma: 10
Join Date: Nov 2007
Device: Sony 505
Kovid,

thanks for the link...it is very helpful. I'll try a couple of the expressions out.

bhavesh
veshman is offline   Reply With Quote
Old 11-28-2007, 01:18 AM   #73
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,835
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
version 0.4.25 finally implements support for The Economist. See demo attached to first post.
kovidgoyal is offline   Reply With Quote
Old 11-28-2007, 01:31 AM   #74
DaveNB
Connoisseur
DaveNB has a complete set of Star Wars action figures.DaveNB has a complete set of Star Wars action figures.DaveNB has a complete set of Star Wars action figures.DaveNB has a complete set of Star Wars action figures.
 
Posts: 86
Karma: 399
Join Date: Jun 2007
Device: Nook, Sony PRS-500, Nokia 770 (FBReader)
Version 0.6 of Wired.py posted

I edited the previous post to reflect the changes in the source code for the newest wired.py User Profile for web2lrf.

There is major improvement in the proper rendering/placement of inline images and proper display of inline hypertext linked phrases/words.

However, there are some issues with text encoding that to fix the problems with the apostrophe's (sometimes Wired uses a simple vertical tic, sometimes they use the apostrophe where the tail curves down to the left, the latter renders strangely as 3 international characters on the Sony Reader). Version 0.6 attempts to fix this but so far, I can't seem to get the right character/hex sequence for the problematic apostrophe character (right single quote) to substitute it out.

Give it a try and let me know if any one can figure out how to fix the apostrophe problem.

Dave

Last edited by DaveNB; 11-28-2007 at 03:55 AM.
DaveNB is offline   Reply With Quote
Old 11-28-2007, 01:56 AM   #75
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,835
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
The problem with wired is that the files are encoded in UTF8 but they specify the encoding as iso8859-1. You can try either
1) Contact wired
2) write a preprocess regexp that changes the specified encoding
Code:
(r'<meta http-equiv="Content-Type" content="text/html; charset=(\S+)"',
 lambda match : match.group().replace(match.group(1), 'UTF-8'))
kovidgoyal is offline   Reply With Quote
Reply

Tags
libprs500, web2lrf

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
web2lrf to capture blog archive? Deputy-Dawg Sony Reader Dev Corner 1 02-14-2008 11:41 PM
web2lrf: La Repubblica alexxxm Sony Reader 1 11-13-2007 12:27 PM


All times are GMT -4. The time now is 09:34 PM.


MobileRead.com is a privately owned, operated and funded community.