web2lrf - Page 7

StDo · 12-02-2007, 03:58 PM

That's it. Thanks.

By the way, how can I provide the skipping of an article without publication date?

Quote:

[DEBUG] __init__.pyo:172: Skipping article as it does not have publication date
[DEBUG] __init__.pyo:172: Skipping article as it does not have publication date

kovidgoyal · 12-02-2007, 04:30 PM

I'm not sure what you mean? You want to include articles that don't have a publication date? In that case, the only way to do it is to redefine the parse_feeds function in your profile.

StDo · 12-02-2007, 04:50 PM

Kovid, i tried to get the spiegelde.py running.

spiegelde.py:

Code:

from libprs500.ebooks.lrf.web.profiles import DefaultProfile

import re

class SpiegelOnline(DefaultProfile): 
    
    title = 'Spiegel Online' 
    timefmt = ' [ %Y-%m-%d %a]'
    max_recursions = 1
    max_articles_per_feed = 40
    html_description = True
    no_stylesheets = True

    
    def get_feeds(self): 
        return [ ('Spiegel Online', 'http://www.spiegel.de/schlagzeilen/rss/0,5291,,00.xml') ] 
    
    def print_version(self,url):
        tokens = url.split(',') 
        tokens[-2:-1] = ['-druck']
        return ','.join(tokens)

But the spiegel.de RSS feed shows the time format only as "Heute um 20:00 Uhr" (that means: "Today at 8 p.m.").

See: http://www.spiegel.de/schlagzeilen/rss/0,5291,,00.xml

kovidgoyal · 12-02-2007, 04:56 PM

Then you will have to redefine the function strptime. The function takes a string argument and should return the number of seconds since the epoch (Jan 1 1970) in the GMT time zone.

something like

Code:

def strptime(self, src):
    # Some code to convert the string src into a datetime
    # This is a dummy implemetation that just returns the current time
    return time.time()

StDo · 12-02-2007, 05:51 PM

Seems to be hard work, will try to config it in a few days...

Can't I tell web2lrf that it should take all articles shown, because there seems to be only roundabout 40-50 articles at spiegel.de

kovidgoyal · 12-02-2007, 05:55 PM

Just define the dummy strptime function as show above and that will do this.

StDo · 12-02-2007, 06:14 PM

Sorry, getting the same error...

Code:

'''
Fetch Spiegel Online.
'''

from libprs500.ebooks.lrf.web.profiles import DefaultProfile

import re

class SpiegelOnline(DefaultProfile): 
    
    title = 'Spiegel Online' 
    timefmt = ' [ %Y-%m-%d %a]'
    max_recursions = 2
    max_articles_per_feed = 40
#    html_description = True
#    no_stylesheets = True

    
    def get_feeds(self): 
        return [ ('Spiegel Online', 'http://www.spiegel.de/schlagzeilen/rss/0,5291,,00.xml') ] 

    def strptime(self, src):
        # Some code to convert the string src into a datetime
        # This is a dummy implemetation that just returns the current time
        return time.time()
    
    def print_version(self,url):
        tokens = url.split(',') 
        tokens[-2:-1] = ['-druck']
        return ','.join(tokens)

kovidgoyal · 12-02-2007, 06:28 PM

Ah I see that the feed has no publication date. OK. I've added a use_pubdate variable (in svn). Set it to False to prevent web2lrf from trying to figure out the publication date

Code:

use_pubdate = False

JTravers · 12-03-2007, 06:33 AM

I have a profile setup for WSJ.com. I'm trying to get it configured to work with subscription content (only for those that have a valid paid subscription, of course).

The problem is that WSJ.com does not allow multiple, concurrent logins. If it detects multiple, concurrent logins, your account is subsequently locked until you call customer service.

So the 1st time I logged in through the web2lrf profile, everything worked and downloaded properly. However, every subsequent time I tried using the profile, the login didn't work (account was locked), so only non-subscription content was downloaded.

In order to prevent this, I believe one needs to log out of the site before exiting web2lrf. Is there way to logout of a site using web2lrf? Perhaps the same kind of functionality as the login, but it would be processed at the end of the process instead of the beginning.

This dilemma also applies to the Barrons.com site (since they are under the same umbrella as the WSJ.com). My profile for this only worked a couple times before I got locked out of the site.

Thanks for your help with this.
(.txt extension added to facilitate the upload)

kovidgoyal · 12-03-2007, 12:55 PM

I've added a cleanup method to the profile that's called after the LRF file has been generated. You can use self.browser to logout in that method.

JTravers · 12-03-2007, 05:26 PM

Quote:

Originally Posted by kovidgoyal

I've added a cleanup method to the profile that's called after the LRF file has been generated. You can use self.browser to logout in that method.

Thank you so much for adding this.

I'm going to need some help on the proper code to use, though, due to my ignorance of python.

Would adding something like this to my profile work?

Code:

        def cleanup(self): 
                return  [
                self.browser.open('http://online.barrons.com/logout') 
                ]

Thanks for your help with this.

One other question for you, if you don't mind. How do you add the --ignore-tables option to the profile, so you don't have to specify it on the command-line every time you use the profile?

Thanks again.

kovidgoyal · 12-03-2007, 06:12 PM

Yeah that should do it, no need to return anything though.

Use

Code:

html2lrf_options = ['--ignore-tables']

StDo · 12-03-2007, 06:42 PM

Quote:

Originally Posted by kovidgoyal

Code:

def print_version(self,url):
    tokens = url.split(',') 
    tokens[-2:-1] = ['druck-']
    return ','.join(tokens)

Kovid,
that snippet you gave me replaces the numbers between the last comma and the second last comma with "druck-". But the numbers there should remain and "druck-" should be added in front of the numbers and after the second last comma.

The original link:

Code:

http://www.spiegel.de/panorama/justiz/0,1518,521183,00.html

should be

Code:

http://www.spiegel.de/panorama/justiz/0,1518,druck-521183,00.html

and not (as it will be done with the snippet above):

Code:

http://www.spiegel.de/panorama/justiz/0,1518,druck-,00.html

Thanks for thinking and coding.

JTravers · 12-03-2007, 08:03 PM

Quote:

Originally Posted by kovidgoyal

Yeah that should do it, no need to return anything though.

Use

Code:

html2lrf_options = ['--ignore-tables']

When trying the cleanup code, web2lrf hangs right after generating the lrf. I used the following code:

Code:

        def cleanup(self): 
                self.browser.open('http://online.barrons.com/logout')

For Barron's, I have to set max recursions to 3 because there are some articles that are divided into two parts (even the print versions). Doing this, however, causes web2lrf to follow a bunch of other links which end up being garbage and taking it off the Barron's website. Is there a way to restrict the links that web2lrf follows? I've tried the following, but it didn't seem to work:

Code:

        match_regexps = ['<a.*?mod=.*?>']

and I also tried:

Code:

        match_regexps = ['<a.*?online.barrons.com.*?>']

It doesn't seem like either is having an effect. I know I'm probably misusing these options, so any guidance would be appreciated.

Finally, I tried using html2lrf_options before (and again now), and it doesn't seem to give the same output that is generated when specifying --ignore-tables on the command line. Not sure why.

kovidgoyal · 12-03-2007, 09:05 PM

@StDo
Oops sorry. Here you go

Code:

def print_version(self,url):
    tokens = url.split(',')
    tokens[-2:-2] = ['druck|']
    return ','.join(tokens).replace('|,','-')

@JTravers
match_regexp works on the contents of the href attribute, i.e. the URL itself, not on the <a> tag. As for html2lrf_options, looks like a regression, they aren't being applied. Will be fixed in the next release.
Not sure why the cleanup code should hang, I'll look at that later.

12-02-2007, 04:56 PM	#94
kovidgoyal creator of calibre Posts: 45,918 Karma: 29228280 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Then you will have to redefine the function strptime. The function takes a string argument and should return the number of seconds since the epoch (Jan 1 1970) in the GMT time zone. something like Code: def strptime(self, src): # Some code to convert the string src into a datetime # This is a dummy implemetation that just returns the current time return time.time() Last edited by kovidgoyal; 12-02-2007 at 05:00 PM.

12-02-2007, 06:28 PM	#98
kovidgoyal creator of calibre Posts: 45,918 Karma: 29228280 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Ah I see that the feed has no publication date. OK. I've added a use_pubdate variable (in svn). Set it to False to prevent web2lrf from trying to figure out the publication date Code: use_pubdate = False

12-03-2007, 06:12 PM	#102
kovidgoyal creator of calibre Posts: 45,918 Karma: 29228280 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Yeah that should do it, no need to return anything though. Use Code: html2lrf_options = ['--ignore-tables']

12-03-2007, 09:05 PM	#105
kovidgoyal creator of calibre Posts: 45,918 Karma: 29228280 Join Date: Oct 2006 Location: Mumbai, India Device: Various	@StDo Oops sorry. Here you go Code: def print_version(self,url): tokens = url.split(',') tokens[-2:-2] = ['druck\|'] return ','.join(tokens).replace('\|,','-') @JTravers match_regexp works on the contents of the href attribute, i.e. the URL itself, not on the <a> tag. As for html2lrf_options, looks like a regression, they aren't being applied. Will be fixed in the next release. Not sure why the cleanup code should hang, I'll look at that later.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
web2lrf to capture blog archive?	Deputy-Dawg	Sony Reader Dev Corner	1	02-15-2008 12:41 AM
web2lrf: La Repubblica	alexxxm	Sony Reader	1	11-13-2007 01:27 PM

12-02-2007, 04:30 PM	#92
kovidgoyal creator of calibre Posts: 45,918 Karma: 29228280 Join Date: Oct 2006 Location: Mumbai, India Device: Various	I'm not sure what you mean? You want to include articles that don't have a publication date? In that case, the only way to do it is to redefine the parse_feeds function in your profile.

12-02-2007, 05:51 PM	#95
StDo Translating Calibre... Posts: 657 Karma: 2902 Join Date: Aug 2007 Location: ER.de Device: [PRS-500], PB360	Seems to be hard work, will try to config it in a few days... Can't I tell web2lrf that it should take all articles shown, because there seems to be only roundabout 40-50 articles at spiegel.de

12-02-2007, 05:55 PM	#96
kovidgoyal creator of calibre Posts: 45,918 Karma: 29228280 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Just define the dummy strptime function as show above and that will do this.

12-03-2007, 12:55 PM	#100
kovidgoyal creator of calibre Posts: 45,918 Karma: 29228280 Join Date: Oct 2006 Location: Mumbai, India Device: Various	I've added a cleanup method to the profile that's called after the LRF file has been generated. You can use self.browser to logout in that method.