12-02-2007, 02:58 PM | #91 | |
Translating Calibre...
Posts: 657
Karma: 2902
Join Date: Aug 2007
Location: ER.de
Device: [PRS-500], PB360
|
That's it. Thanks.
By the way, how can I provide the skipping of an article without publication date? Quote:
|
|
12-02-2007, 03:30 PM | #92 |
creator of calibre
Posts: 43,856
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
I'm not sure what you mean? You want to include articles that don't have a publication date? In that case, the only way to do it is to redefine the parse_feeds function in your profile.
|
Advert | |
|
12-02-2007, 03:50 PM | #93 |
Translating Calibre...
Posts: 657
Karma: 2902
Join Date: Aug 2007
Location: ER.de
Device: [PRS-500], PB360
|
Kovid, i tried to get the spiegelde.py running.
spiegelde.py: Code:
from libprs500.ebooks.lrf.web.profiles import DefaultProfile import re class SpiegelOnline(DefaultProfile): title = 'Spiegel Online' timefmt = ' [ %Y-%m-%d %a]' max_recursions = 1 max_articles_per_feed = 40 html_description = True no_stylesheets = True def get_feeds(self): return [ ('Spiegel Online', 'http://www.spiegel.de/schlagzeilen/rss/0,5291,,00.xml') ] def print_version(self,url): tokens = url.split(',') tokens[-2:-1] = ['-druck'] return ','.join(tokens) But the spiegel.de RSS feed shows the time format only as "Heute um 20:00 Uhr" (that means: "Today at 8 p.m."). See: http://www.spiegel.de/schlagzeilen/rss/0,5291,,00.xml |
12-02-2007, 03:56 PM | #94 |
creator of calibre
Posts: 43,856
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Then you will have to redefine the function strptime. The function takes a string argument and should return the number of seconds since the epoch (Jan 1 1970) in the GMT time zone.
something like Code:
def strptime(self, src): # Some code to convert the string src into a datetime # This is a dummy implemetation that just returns the current time return time.time() Last edited by kovidgoyal; 12-02-2007 at 04:00 PM. |
12-02-2007, 04:51 PM | #95 |
Translating Calibre...
Posts: 657
Karma: 2902
Join Date: Aug 2007
Location: ER.de
Device: [PRS-500], PB360
|
Seems to be hard work, will try to config it in a few days...
Can't I tell web2lrf that it should take all articles shown, because there seems to be only roundabout 40-50 articles at spiegel.de |
Advert | |
|
12-02-2007, 04:55 PM | #96 |
creator of calibre
Posts: 43,856
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Just define the dummy strptime function as show above and that will do this.
|
12-02-2007, 05:14 PM | #97 |
Translating Calibre...
Posts: 657
Karma: 2902
Join Date: Aug 2007
Location: ER.de
Device: [PRS-500], PB360
|
Sorry, getting the same error...
Code:
''' Fetch Spiegel Online. ''' from libprs500.ebooks.lrf.web.profiles import DefaultProfile import re class SpiegelOnline(DefaultProfile): title = 'Spiegel Online' timefmt = ' [ %Y-%m-%d %a]' max_recursions = 2 max_articles_per_feed = 40 # html_description = True # no_stylesheets = True def get_feeds(self): return [ ('Spiegel Online', 'http://www.spiegel.de/schlagzeilen/rss/0,5291,,00.xml') ] def strptime(self, src): # Some code to convert the string src into a datetime # This is a dummy implemetation that just returns the current time return time.time() def print_version(self,url): tokens = url.split(',') tokens[-2:-1] = ['-druck'] return ','.join(tokens) |
12-02-2007, 05:28 PM | #98 |
creator of calibre
Posts: 43,856
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Ah I see that the feed has no publication date. OK. I've added a use_pubdate variable (in svn). Set it to False to prevent web2lrf from trying to figure out the publication date
Code:
use_pubdate = False |
12-03-2007, 05:33 AM | #99 |
Groupie
Posts: 182
Karma: 1078201
Join Date: Sep 2007
Device: iPad Air 2
|
Wall Street Journal
I have a profile setup for WSJ.com. I'm trying to get it configured to work with subscription content (only for those that have a valid paid subscription, of course).
The problem is that WSJ.com does not allow multiple, concurrent logins. If it detects multiple, concurrent logins, your account is subsequently locked until you call customer service. So the 1st time I logged in through the web2lrf profile, everything worked and downloaded properly. However, every subsequent time I tried using the profile, the login didn't work (account was locked), so only non-subscription content was downloaded. In order to prevent this, I believe one needs to log out of the site before exiting web2lrf. Is there way to logout of a site using web2lrf? Perhaps the same kind of functionality as the login, but it would be processed at the end of the process instead of the beginning. This dilemma also applies to the Barrons.com site (since they are under the same umbrella as the WSJ.com). My profile for this only worked a couple times before I got locked out of the site. Thanks for your help with this. (.txt extension added to facilitate the upload) |
12-03-2007, 11:55 AM | #100 |
creator of calibre
Posts: 43,856
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
I've added a cleanup method to the profile that's called after the LRF file has been generated. You can use self.browser to logout in that method.
|
12-03-2007, 04:26 PM | #101 | |
Groupie
Posts: 182
Karma: 1078201
Join Date: Sep 2007
Device: iPad Air 2
|
Quote:
I'm going to need some help on the proper code to use, though, due to my ignorance of python. Would adding something like this to my profile work? Code:
def cleanup(self): return [ self.browser.open('http://online.barrons.com/logout') ] One other question for you, if you don't mind. How do you add the --ignore-tables option to the profile, so you don't have to specify it on the command-line every time you use the profile? Thanks again. |
|
12-03-2007, 05:12 PM | #102 |
creator of calibre
Posts: 43,856
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Yeah that should do it, no need to return anything though.
Use Code:
html2lrf_options = ['--ignore-tables'] |
12-03-2007, 05:42 PM | #103 | |
Translating Calibre...
Posts: 657
Karma: 2902
Join Date: Aug 2007
Location: ER.de
Device: [PRS-500], PB360
|
Quote:
that snippet you gave me replaces the numbers between the last comma and the second last comma with "druck-". But the numbers there should remain and "druck-" should be added in front of the numbers and after the second last comma. The original link: Code:
http://www.spiegel.de/panorama/justiz/0,1518,521183,00.html Code:
http://www.spiegel.de/panorama/justiz/0,1518,druck-521183,00.html Code:
http://www.spiegel.de/panorama/justiz/0,1518,druck-,00.html |
|
12-03-2007, 07:03 PM | #104 | |
Groupie
Posts: 182
Karma: 1078201
Join Date: Sep 2007
Device: iPad Air 2
|
Quote:
Code:
def cleanup(self): self.browser.open('http://online.barrons.com/logout') Code:
match_regexps = ['<a.*?mod=.*?>'] Code:
match_regexps = ['<a.*?online.barrons.com.*?>'] Finally, I tried using html2lrf_options before (and again now), and it doesn't seem to give the same output that is generated when specifying --ignore-tables on the command line. Not sure why. |
|
12-03-2007, 08:05 PM | #105 |
creator of calibre
Posts: 43,856
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
@StDo
Oops sorry. Here you go Code:
def print_version(self,url): tokens = url.split(',') tokens[-2:-2] = ['druck|'] return ','.join(tokens).replace('|,','-') match_regexp works on the contents of the href attribute, i.e. the URL itself, not on the <a> tag. As for html2lrf_options, looks like a regression, they aren't being applied. Will be fixed in the next release. Not sure why the cleanup code should hang, I'll look at that later. |
Tags |
libprs500, web2lrf |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
web2lrf to capture blog archive? | Deputy-Dawg | Sony Reader Dev Corner | 1 | 02-14-2008 11:41 PM |
web2lrf: La Repubblica | alexxxm | Sony Reader | 1 | 11-13-2007 12:27 PM |