| 
			
			 | 
		#91 | |
| 
			
			
			
			 Translating Calibre... 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 657 
				Karma: 2902 
				Join Date: Aug 2007 
				Location: ER.de 
				
				
				Device: [PRS-500], PB360 
				
				
				 | 
	
	
	
		
		
		
		
		 
			
			That's it. Thanks.  
		
	
		
		
		
		
		
		
		
		
		
		
	
	By the way, how can I provide the skipping of an article without publication date? Quote: 
	
  | 
|
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#92 | 
| 
			
			
			
			 creator of calibre 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,609 
				Karma: 28549044 
				Join Date: Oct 2006 
				Location: Mumbai, India 
				
				
				Device: Various 
				
				
				 | 
	
	
	
		
		
		
		
		 
			
			I'm not sure what you mean? You want to include articles that don't have a publication date? In that case, the only way to do it is to redefine the parse_feeds function in your profile.
		 
		
	
		
		
		
		
		
		
		
		
		
		
	
	 | 
| 
		
 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#93 | 
| 
			
			
			
			 Translating Calibre... 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 657 
				Karma: 2902 
				Join Date: Aug 2007 
				Location: ER.de 
				
				
				Device: [PRS-500], PB360 
				
				
				 | 
	
	
	
		
		
		
		
		 
			
			Kovid, i tried to get the spiegelde.py running. 
		
	
		
		
		
		
		
		
		
		
		
		
	
	spiegelde.py: Code: 
	from libprs500.ebooks.lrf.web.profiles import DefaultProfile
import re
class SpiegelOnline(DefaultProfile): 
    
    title = 'Spiegel Online' 
    timefmt = ' [ %Y-%m-%d %a]'
    max_recursions = 1
    max_articles_per_feed = 40
    html_description = True
    no_stylesheets = True
    
    def get_feeds(self): 
        return [ ('Spiegel Online', 'http://www.spiegel.de/schlagzeilen/rss/0,5291,,00.xml') ] 
    
    def print_version(self,url):
        tokens = url.split(',') 
        tokens[-2:-1] = ['-druck']
        return ','.join(tokens)
But the spiegel.de RSS feed shows the time format only as "Heute um 20:00 Uhr" (that means: "Today at 8 p.m."). See: http://www.spiegel.de/schlagzeilen/rss/0,5291,,00.xml  | 
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#94 | 
| 
			
			
			
			 creator of calibre 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,609 
				Karma: 28549044 
				Join Date: Oct 2006 
				Location: Mumbai, India 
				
				
				Device: Various 
				
				
				 | 
	
	
	
		
		
		
		
		 
			
			Then you will have to redefine the function strptime. The function takes a string argument and should return the number of seconds since the epoch (Jan 1 1970) in the GMT time zone. 
		
	
		
		
		
		
		
		
		
		
		
		
		
			something like Code: 
	def strptime(self, src):
    # Some code to convert the string src into a datetime
    # This is a dummy implemetation that just returns the current time
    return time.time()
Last edited by kovidgoyal; 12-02-2007 at 05:00 PM.  | 
| 
		
 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#95 | 
| 
			
			
			
			 Translating Calibre... 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 657 
				Karma: 2902 
				Join Date: Aug 2007 
				Location: ER.de 
				
				
				Device: [PRS-500], PB360 
				
				
				 | 
	
	
	
		
		
		
		
		 
			
			Seems to be hard work, will try to config it in a few days... 
		
	
		
		
		
		
		
		
		
		
		
		
	
	Can't I tell web2lrf that it should take all articles shown, because there seems to be only roundabout 40-50 articles at spiegel.de  | 
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#96 | 
| 
			
			
			
			 creator of calibre 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,609 
				Karma: 28549044 
				Join Date: Oct 2006 
				Location: Mumbai, India 
				
				
				Device: Various 
				
				
				 | 
	
	
	
		
		
		
		
		 
			
			Just define the dummy strptime function as show above and that will do this.
		 
		
	
		
		
		
		
		
		
		
		
		
		
	
	 | 
| 
		
 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#97 | 
| 
			
			
			
			 Translating Calibre... 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 657 
				Karma: 2902 
				Join Date: Aug 2007 
				Location: ER.de 
				
				
				Device: [PRS-500], PB360 
				
				
				 | 
	
	
	
		
		
		
		
		 
			
			Sorry, getting the same error... 
		
	
		
		
		
		
		
		
		
		
		
		
	
	Code: 
	'''
Fetch Spiegel Online.
'''
from libprs500.ebooks.lrf.web.profiles import DefaultProfile
import re
class SpiegelOnline(DefaultProfile): 
    
    title = 'Spiegel Online' 
    timefmt = ' [ %Y-%m-%d %a]'
    max_recursions = 2
    max_articles_per_feed = 40
#    html_description = True
#    no_stylesheets = True
    
    def get_feeds(self): 
        return [ ('Spiegel Online', 'http://www.spiegel.de/schlagzeilen/rss/0,5291,,00.xml') ] 
    def strptime(self, src):
        # Some code to convert the string src into a datetime
        # This is a dummy implemetation that just returns the current time
        return time.time()
    
    def print_version(self,url):
        tokens = url.split(',') 
        tokens[-2:-1] = ['-druck']
        return ','.join(tokens)
 | 
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#98 | 
| 
			
			
			
			 creator of calibre 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,609 
				Karma: 28549044 
				Join Date: Oct 2006 
				Location: Mumbai, India 
				
				
				Device: Various 
				
				
				 | 
	
	
	
		
		
		
		
		 
			
			Ah I see that the feed has no publication date. OK. I've added a use_pubdate variable (in svn). Set it to False to prevent web2lrf from trying to figure out the publication date 
		
	
		
		
		
		
		
		
		
		
		
		
	
	Code: 
	use_pubdate = False  | 
| 
		
 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#99 | 
| 
			
			
			
			 Groupie 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 182 
				Karma: 1078201 
				Join Date: Sep 2007 
				
				
				
				Device: iPad Air 2 
				
				
				 | 
	
	
	
		
		
			
			 
				
				Wall Street Journal
			 
			
			
			I have a profile setup for WSJ.com. I'm trying to get it configured to work with subscription content (only for those that have a valid paid subscription, of course). 
		
	
		
		
			The problem is that WSJ.com does not allow multiple, concurrent logins. If it detects multiple, concurrent logins, your account is subsequently locked until you call customer service. So the 1st time I logged in through the web2lrf profile, everything worked and downloaded properly. However, every subsequent time I tried using the profile, the login didn't work (account was locked), so only non-subscription content was downloaded. In order to prevent this, I believe one needs to log out of the site before exiting web2lrf. Is there way to logout of a site using web2lrf? Perhaps the same kind of functionality as the login, but it would be processed at the end of the process instead of the beginning. This dilemma also applies to the Barrons.com site (since they are under the same umbrella as the WSJ.com). My profile for this only worked a couple times before I got locked out of the site. Thanks for your help with this. (.txt extension added to facilitate the upload)  | 
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#100 | 
| 
			
			
			
			 creator of calibre 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,609 
				Karma: 28549044 
				Join Date: Oct 2006 
				Location: Mumbai, India 
				
				
				Device: Various 
				
				
				 | 
	
	
	
		
		
		
		
		 
			
			I've added a cleanup method to the profile that's called after the LRF file has been generated. You can use self.browser to logout in that method.
		 
		
	
		
		
		
		
		
		
		
		
		
		
	
	 | 
| 
		
 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#101 | |
| 
			
			
			
			 Groupie 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 182 
				Karma: 1078201 
				Join Date: Sep 2007 
				
				
				
				Device: iPad Air 2 
				
				
				 | 
	
	
	
		
		
		
		
		 Quote: 
	
 I'm going to need some help on the proper code to use, though, due to my ignorance of python. Would adding something like this to my profile work? Code: 
	        def cleanup(self): 
                return  [
                self.browser.open('http://online.barrons.com/logout') 
                ]
One other question for you, if you don't mind. How do you add the --ignore-tables option to the profile, so you don't have to specify it on the command-line every time you use the profile? Thanks again.  | 
|
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#102 | 
| 
			
			
			
			 creator of calibre 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,609 
				Karma: 28549044 
				Join Date: Oct 2006 
				Location: Mumbai, India 
				
				
				Device: Various 
				
				
				 | 
	
	
	
		
		
		
		
		 
			
			Yeah that should do it, no need to return anything though.  
		
	
		
		
		
		
		
		
		
		
		
		
	
	Use Code: 
	html2lrf_options = ['--ignore-tables']  | 
| 
		
 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#103 | |
| 
			
			
			
			 Translating Calibre... 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 657 
				Karma: 2902 
				Join Date: Aug 2007 
				Location: ER.de 
				
				
				Device: [PRS-500], PB360 
				
				
				 | 
	
	
	
		
		
		
		
		 Quote: 
	
 that snippet you gave me replaces the numbers between the last comma and the second last comma with "druck-". But the numbers there should remain and "druck-" should be added in front of the numbers and after the second last comma. The original link: Code: 
	http://www.spiegel.de/panorama/justiz/0,1518,521183,00.html Code: 
	http://www.spiegel.de/panorama/justiz/0,1518,druck-521183,00.html Code: 
	http://www.spiegel.de/panorama/justiz/0,1518,druck-,00.html  
		 | 
|
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#104 | |
| 
			
			
			
			 Groupie 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 182 
				Karma: 1078201 
				Join Date: Sep 2007 
				
				
				
				Device: iPad Air 2 
				
				
				 | 
	
	
	
		
		
		
		
		 Quote: 
	
 Code: 
	        def cleanup(self): 
                self.browser.open('http://online.barrons.com/logout')
Code: 
	match_regexps = ['<a.*?mod=.*?>'] Code: 
	match_regexps = ['<a.*?online.barrons.com.*?>'] Finally, I tried using html2lrf_options before (and again now), and it doesn't seem to give the same output that is generated when specifying --ignore-tables on the command line. Not sure why.  | 
|
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#105 | 
| 
			
			
			
			 creator of calibre 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,609 
				Karma: 28549044 
				Join Date: Oct 2006 
				Location: Mumbai, India 
				
				
				Device: Various 
				
				
				 | 
	
	
	
		
		
		
		
		 
			
			@StDo 
		
	
		
		
		
		
		
		
		
		
		
		
	
	Oops sorry. Here you go Code: 
	def print_version(self,url):
    tokens = url.split(',')
    tokens[-2:-2] = ['druck|']
    return ','.join(tokens).replace('|,','-')
match_regexp works on the contents of the href attribute, i.e. the URL itself, not on the <a> tag. As for html2lrf_options, looks like a regression, they aren't being applied. Will be fixed in the next release. Not sure why the cleanup code should hang, I'll look at that later.  | 
| 
		
 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
![]()  | 
            
        
            
| Tags | 
| libprs500, web2lrf | 
            
  | 
    
			 
			Similar Threads
		 | 
	||||
| Thread | Thread Starter | Forum | Replies | Last Post | 
| web2lrf to capture blog archive? | Deputy-Dawg | Sony Reader Dev Corner | 1 | 02-15-2008 12:41 AM | 
| web2lrf: La Repubblica | alexxxm | Sony Reader | 1 | 11-13-2007 01:27 PM |