View Full Version : web2lrf


Pages : [1] 2

kovidgoyal
07-13-2007, 08:44 PM
Building on my work with web2disk, here's web2lrf (part of libprs500 (http://libprs500.kovidgoyal.net) v0.3.70).

It directly converts websites into LRF files. More than that it has support for profiles that allow it to preprocess websites to generate better looking LRF files. Right now it knows about the New York Times, The BBC, The Economist and Newsweek (see attached demos).

To use it with a profile:

web2lrf profilename


For e.g. for newsweek

web2lrf newsweek


For The New York Times

web2lrf --username myusername --password mypassword nytimes


To create your own profile and use it with web2lrf visit
https://libprs500.kovidgoyal.net/wiki/UserProfiles for instructions and examples.

To use it with an arbitrary website (it wont do any preprocessing)

web2lrf --url http://mywebsite.com default


Enjoy!

kovidgoyal
07-14-2007, 04:52 PM
Released v0.3.62 with a newsweek profile. See the attached demo in the first post. Since I actually read Newsweek, I've taken a little more pain over this profile, it has a nice hierarchical TOC.

Ironic since for some reason I haven't been getting my newsweeks for the past month ;-)

RWood
07-14-2007, 06:17 PM
I quickly uninstalled the prior version and installed the new one. Tried it.

"TypeError: option_parser takes() takes no arguments (1 given)"

Even the nytimes demo gave the same result.

kovidgoyal
07-14-2007, 06:26 PM
Oops typo...released 0.3.73 with the fix. It'll take 20 mins to reach the servers.

ddavtian
07-15-2007, 02:03 AM
Kovidgoyal, thanks a lot for all your work.
NYTimes and BBC work fine for me, Newsweek gives an error message.
I'm looking forward for more profiles :-)

C:\Temp>web2lrf newsweek
Fetching feeds... done
Downloading .WARNING: Could not fetch link file://c:\docume~1\davidd~1\locals~1\
temp\libprs500oezp07\index.html

file://c:\docume~1\davidd~1\locals~1\temp\libprs500oezp07 \index.html saved to
Traceback (most recent call last):
File "convert_from.py", line 124, in <module>
File "convert_from.py", line 116, in main
File "convert_from.py", line 74, in create_lrf
File "libprs500\ebooks\lrf\html\convert_from.pyo", line 1233, in process_file
File "libprs500\ebooks\lrf\html\convert_from.pyo", line 1431, in get_path
File "libprs500\__init__.pyo", line 74, in extract
Exception: Unknown archive type

kovidgoyal
07-15-2007, 03:48 AM
windows strikes again! fixed in 0.3.74.

As for new profiles, nothing planned as my needs are met. But feel free to contribute :-)

Platapie
07-15-2007, 09:33 AM
You my friend, are a god. I just ordered my PRS-500 (not here yet but I can't wait), and was preemptively searching for ways to do this-- realize I'm jumping the gun here, but thanks so much.

Also great to see it offered for Linux :)

JSWolf
07-15-2007, 09:59 AM
windows strikes again! fixed in 0.3.74.

As for new profiles, nothing planned as my needs are met. But feel free to contribute :-)
Version 0.3.74 is not up on the website. Only 0.3.73.

RWood
07-15-2007, 12:50 PM
Still coming up 0.3.73 for me.

kovidgoyal
07-15-2007, 02:36 PM
Sorry uploading now.

kovidgoyal
07-15-2007, 02:39 PM
You my friend, are a god. I just ordered my PRS-500 (not here yet but I can't wait), and was preemptively searching for ways to do this-- realize I'm jumping the gun here, but thanks so much.

Also great to see it offered for Linux :)

It was developed for linux, windows and OSX support came later. Indeed it would not have been possible without all the other great free software that's been developed for linux.

ddavtian
07-16-2007, 05:17 PM
Kovidgoyal, Newsweek now works for Windows, but the output is a rss type small file. Articles are not pulled.

I have few basic questions, sorry couldn't find in many pages of stickies.
How can I make the font smaller? Do I need to load more (or different) fonts to the reader or it's a setting somewhere for web2lrf?

Where can I find the profiles for NYTimes (or BBC or Newsweek)? I'd like to use it as a good template for other sites.

Thanks in advance,
David (not a power user)

kovidgoyal
07-16-2007, 08:33 PM
You can see the profiles here https://libprs500.kovidgoyal.net/browser/trunk/src/libprs500/ebooks/lrf/web?order=name

The newsweek error will be fixed in the next release.

Platapie
07-20-2007, 03:06 PM
It was developed for linux, windows and OSX support came later. Indeed it would not have been possible without all the other great free software that's been developed for linux.

Fair enough. Well, what can I say? I finally got my machine/the time to play with it and it's fantastic. The only issue I had was installing convertlit from source-- it needs, according to the README "LIBTOMMATH" to compile but the address listed in the README is incorrect and instead one needs to get LibTom from http://libtom.org/.

Nonetheless, once that was done I tried out the NYTime script and it worked like a charm. Thank you so much for the hard work.

kovidgoyal
07-20-2007, 05:53 PM
You only need convertlit if you plan on converting lit files. The rest of libprs500 will work just fine without it.

Platapie
07-20-2007, 09:01 PM
You only need convertlit if you plan on converting lit files. The rest of libprs500 will work just fine without it.

Ah, gotcha. For those who do wish to use converlit, I forgot to mention that you also need to update the makefile to aim it at the appropriate libtom file, or else it wont compile properly.

geekraver
07-27-2007, 04:21 PM
Awesome job, Kovid. I have but one request - can you add an option to ignore font families? I'd like to always use the built-in Sans-Serif font, as it is fast and readable. That would be cool.

kovidgoyal
07-27-2007, 04:42 PM
Thanks, open a ticket and I'll see to it when I get the time.

kovidgoyal
08-17-2007, 12:20 AM
v 0.3.96 has a new improved nytimes feed with support for logging in (see first post)

kovidgoyal
08-18-2007, 02:12 PM
v 0.3.99 has an improved BBC profile with a hierarchical TOC. See sample in first post.

Adrian
08-23-2007, 08:40 PM
What a fantastic app! This is precisely what I wanted to use my Reader for - reading the New York Times. I don't suppose you could make another script from the NYT Sunday Magazine? The articles are particularly long in the magazine, and would really suit being on the reader.

Keep up the good work!

numindast
09-04-2007, 10:43 PM
oh WOW!

I can get my beloved New York Times on my reader! oohh this makes me excited .. not quite excited-over-a-woman but boy oh boy!

Thanks much for this GREAT tool!

I had my expectations set too high -- I thought NYTimes.com would have a way to download their paper in ebook format, boy was I wrong. Still, I hope I can get this scripted out so that I can grab my Reader in the morning, head for the train, and read the news :) (I am a subscriber, albeit just to the Sunday print edition.)

Thanks!!

avh
09-16-2007, 03:15 AM
Hi kovidgoyal,

Thanks very much for this app. The 3 profiles that you provided are very nice. They are now on my list of daily lrf creation.

I have a website that I read daily, and would like to create a lrf output for it. After executing the command "web2lrf --url http://english.vietnamnet.vn", I got the "Downloading" echo, then the cursor in the command window would just keep blinking.

My thought now is to write a .NET program to download the top level articles in the Politics and Business sections of this website. The next step would be to create the lrf document with a TOC just like you did for the BBC, NY Times, and News Week. This is where I got stuck :blink: . I don't know what is the best way to assemble the web pages into lrf with a TOC. Would you please advise on how best to achieve similar results as your outputs? And did I say they look very nice by the way :)

TIA

kovidgoyal
09-16-2007, 12:09 PM
If you interrupt the LRF creation, the temporary files, will be preserved and you can look at them (they'll be in the temp directory in a subdirectory whose name starts with libprs500).

Also libprs500 contains functions that make this task easy. Take a look at
https://libprs500.kovidgoyal.net/browser/trunk/src/libprs500/ebooks/lrf/web

paying special attention to profiles.py and the individual profile files. As you can see it takes only a few dozen lines of code to create an individual profile.

avh
09-18-2007, 02:40 AM
Thanks kovidgoyal. I will study the profile files.

kovidgoyal
10-13-2007, 02:27 PM
When i release 0.4.9 it will have a modified nytimes profile to only download articles from the previous two days. This should result in a smaller/faster loading LRF file. Let me know what you think of it.

JSWolf
10-13-2007, 05:16 PM
When i release 0.4.9 it will have a modified nytimes profile to only download articles from the previous two days. This should result in a smaller/faster loading LRF file. Let me know what you think of it.
Can we have a profile to only download the current day's paper?

kovidgoyal
10-13-2007, 05:25 PM
web2lrf uses RSS and there is no simple mapping from RSS publication dates to a "day's newspaper"

kovidgoyal
10-26-2007, 10:05 PM
version 0.4.15 should hit the servers soon.
It has support for adding custom profiles to web2lrf. See https://libprs500.kovidgoyal.net/wiki/UserProfiles

StDo
10-29-2007, 08:41 AM
Got a german RSS-newsfeed user-profile for FAZ.net

Check out faznet.py zipped as attachment.

Attention: it is an alpha version. :)

There are still some layout problems...

If anybody got some hints to get the layout a little bit smoother... let me know!

Greetings from germany,

StDo

dietric
11-01-2007, 09:06 PM
I installed an update fro Sony to the Reader software and have been unable to synch files using Web2Book evers ince. The error message is: LoadLibrary failed with error 126. The specified module could not be found. Help would be highly appreciated.

kovidgoyal
11-01-2007, 10:46 PM
Follow the instructions in the Note at

https://libprs500.kovidgoyal.net/download_windows

modsoul
11-01-2007, 11:53 PM
umm how long does it usually take. its been an hour since i started the new york time s conversions and still at it.

kovidgoyal
11-02-2007, 12:00 AM
upgrade to 0.4.17

modsoul
11-02-2007, 08:29 AM
just tried the new version.
thank you so much. you are awesome.
without you i doubt if the reader would be even 1/3 as usefull.

JTravers
11-04-2007, 04:03 AM
Thanks for the great tool, Kovid.

I downloaded it today and was able to use the Newsweek and BBC profiles flawlessly. NYT generated the lrf fine, but it seems none of the articles that require registration downloaded in their entirety. I did use the username and password options, and I verified that my username and password work at the NYT website.

Any idea why it's not working for me?

I'm on Win XP working from a command prompt.

Thanks

kovidgoyal
11-04-2007, 01:29 PM
Hmm looks like the NYT login process has changed. I'll fix it in the next release.

JTravers
11-05-2007, 03:12 AM
Thanks.

I'm going to try to put together some profiles of other news sites over the next week. If I'm successful, I'll post them up here.

kovidgoyal
11-05-2007, 12:03 PM
Cool at some point I should look into adding support for user profiles into the GUI.

dietric
11-05-2007, 08:44 PM
Follow the instructions in the Note at

https://libprs500.kovidgoyal.net/download_windows

Hmm, I tried that, but it is not LibPRS500 that has a problem detecting the reader, just Web2Book,so that didn't really help...

kovidgoyal
11-09-2007, 06:16 PM
Thanks for the great tool, Kovid.

I downloaded it today and was able to use the Newsweek and BBC profiles flawlessly. NYT generated the lrf fine, but it seems none of the articles that require registration downloaded in their entirety. I did use the username and password options, and I verified that my username and password work at the NYT website.

Any idea why it's not working for me?

I'm on Win XP working from a command prompt.

Thanks

Fixed in svn.

JTravers
11-14-2007, 02:51 PM
Fixed in svn.

Thank you!

toomanybarts
11-14-2007, 06:59 PM
since installing and using Libprs500 I now get the same error message as dietric when trying to use another app : rss2book

kovidgoyal
11-14-2007, 07:06 PM
Well that's probably because web2book uses the SONY driver to find the reader. You can't use both the SONY driver and libprs500.

Silvayn
11-19-2007, 04:15 PM
Hmm,

ok, i'm trying to get a printed version url, but i don't know the required string handling commands for python

i need
http://www.sme.sk/c/3592953/Ceskoslovenska-esej.html

to become
http://www.sme.sk/clanok_tlac.asp?cl=3592953


I know of 'replace'...

def print_version(self, url):
return url.replace('/c/', '/clanok_tlac.asp?cl=')

but how do i get rid of the "/Ceskoslovenska-esej.html" at the end?

kovidgoyal
11-19-2007, 04:22 PM
https://libprs500.kovidgoyal.net/wiki/UserProfiles

DaleDe
11-19-2007, 07:53 PM
https://libprs500.kovidgoyal.net/wiki/UserProfiles

By the way, your wiki reference reminded me that I put a short article about libprs500 in the MobileRead wiki. you may want to flush it out with more data.

Dale

kovidgoyal
11-19-2007, 08:00 PM
pydoc str


Look for rpartition

Silvayn
11-20-2007, 11:48 AM
If I understand it correctly, rpartition divides a string into a 3-member array. This doesn't really help me that much, as I don't "speak" python and it's different from the languages that I know. So... if I could ask some python-knowledgable person to give me the exact command for the string conversion... I assume it would cost you about 5 secs of your life :)

Thank you in advance... in return I offer (rusty) pascal & vbscript support :)

i need
http://www.sme.sk/c/3592953/Ceskoslovenska-esej.html

to become
http://www.sme.sk/clanok_tlac.asp?cl=3592953

replace('/c/', '/clanok_tlac.asp?cl=') is step one... but after that i'm stuck

kovidgoyal
11-20-2007, 01:17 PM
Ah well here you go

url = 'http://www.sme.sk/c/3592953/Ceskoslovenska-esej.html'.rpartition('/')[0].replace('c/', 'clanok_tlac.asp?cl=')

JTravers
11-21-2007, 05:45 AM
Kovid,
I noticed that web2lrf ignores/deletes words entirely that have underlying links. This makes some articles a little hard to understand since key words are sometimes left out.

As an example, in the following article the names "David Beckham," "Adidas," and "Pepsi" are all deleted/ignored when it is converted to an lrf.
http://www.nytimes.com/2007/11/17/business/17interview.html?pagewanted=print

I noticed the same thing happens when downloading the html file and running it through html2lrf. I've attached the lrf I generated as an example.

Is there something about linked text that makes it difficult to parse? Or is this simply a bug that needs to be eliminated?

Thanks a lot for your help.

BTW, still trying to get some profiles made. Not knowing Python is proving to be a rather large stumbling block, however.

kovidgoyal
11-21-2007, 11:31 AM
That's a bug, actually a regression I introduced a few versions back. It will be fixed in the next release.

kovidgoyal
11-21-2007, 05:41 PM
BTW, still trying to get some profiles made. Not knowing Python is proving to be a rather large stumbling block, however.

Here's a link to a python tutorial that may be of some help

http://docs.python.org/tut/tut.html

JTravers
11-21-2007, 08:01 PM
Here's a link to a python tutorial that may be of some help

http://docs.python.org/tut/tut.html

Thanks for the link :2thumbsup

I'm really looking forward to getting some more interesting web content onto my 505.

BTW, does web2lrf only accept RSS feeds as input, or can one give it a regular webpage to process?

kovidgoyal
11-22-2007, 01:58 PM
web2lrf --url http://mypage default

will process a website.

tompe
11-22-2007, 04:47 PM
Can you stop the processing after the html has been cleaned up but before the html file tree is removed? (Or how do you get web2html?)

kovidgoyal
11-22-2007, 06:27 PM
web2disk

tompe
11-22-2007, 07:32 PM
Does web2disk really do the cleanup ot the html code? If I only want the files I suppose wget will work also. Or do web2disk do something that wget does not do?

kovidgoyal
11-22-2007, 08:43 PM
It's optimized for downloading websites for conversion to ebooks. Has link filters and recursion level control and a bunch of other features

web2disk --help


cleanup is done by regexps, I dont remeber whether the regexps are passed to web2disk or html2lrf, i think it is web2disk, but there may not be a command line interface to it.

tompe
11-22-2007, 09:19 PM
But if you run web2lrf it seems like the cleanup is done just before the conversion to another format. With --debug it says:

[INFO] convert_from.py:330: Processing 7108374.stm
[INFO] convert_from.py:283: Parsing HTML...
[INFO] convert_from.py:318: Written preprocessed HTML to /tmp/html2lrf-verbose.html
[INFO] convert_from.py:333: Converting to BBeB...


But since "web2disk bbc" is not implemented I have not been able to get the result after the preprocessing so I have not been able to check how it looks.

kovidgoyal
11-23-2007, 12:31 AM
Yeah you'd have to figure out the arguments to web2disk that the BBC profile uses from the source code and pass them manually using the commandline.

veshman
11-23-2007, 11:44 PM
I'm trying to write a converter for Wired magazine. I am totally new to python...how can I add the /print/ into the following URL?


http://www.wired.com/gadgets/digitalcameras/magazine/test2007/dc_burning_question

http://www.wired.com/print/gadgets/digitalcameras/magazine/test2007/dc_burning_question

I'm thinking something like this might work....but I don't know how to make the latter part of the URL a variable that I can put back into the string.

return url.replace('wired.com/?', 'wired.com/print/?')

thanks,

bhavesh

FixB
11-24-2007, 06:18 AM
Sorry veshman : I'm having the same difficulties here on some french rss :)
I would have thought your suggestion should work. Maybe you don't need the "?" as you just replace wired.com with wired.com/print ?
By the way, do someone know how I can keep (and access) the intermediate html files when using web2lrf, so that I could see exactly where my use of regular expressions is faulty ??

FixB
11-24-2007, 06:27 AM
I tried and it seems that :
def print_version(self, url):
return url.replace('wired.com','wired.com/print')
works correctly.
But strangely, not for all articles. The first one seems ok, but the second one is in the 'normale' format... strange :)

veshman
11-24-2007, 10:29 AM
perhaps it should be:

return url.replace('wired.com','wired.com/print/')

with a second "/"

I'll give it a try.

Also, any thoughts on how to keep web2lrf from pursuing external links (e.g. ads)?

thanks,

bhavesh

veshman
11-24-2007, 10:43 AM
so I'm getting the URL to appear correctly using the url.replace function, but for some reason, web2lrf can't process the link.

Processing category6.html
Parsing HTML...
Converting to BBeB...
Could not follow link to http://www.wired.com/print/science/discoveries/magazine/15-11/st_alphageek

If I just copy and paste the URL into a web browser, it works fine.

Bhavesh

veshman
11-24-2007, 11:22 AM
using the url.replace code did work with the addtion of the "/" but web2lrf was unable to find the link, even though it created it correctly.

meaning, if I copy and paste the link that web2lrf is trying to get into a browser, it works fine.

veshman
11-24-2007, 11:25 AM
on the exclude links front, I tried adding an operator to the script, but so far haven't figured it out.

link-exclude = [^wired]

or
link-exclude = ^w^i^r^e^d
or
link-exclude = *[^wired]*

and a number of other failed attempts that give me a syntax error.

kovidgoyal
11-24-2007, 01:15 PM
I'm on my thanksgiving break right now, so I can't help in detail, but you may find this page helpful

http://docs.python.org/lib/re-syntax.html

DaveNB
11-25-2007, 09:55 PM
Try this script. Copy the text below the ------ and save/paste it into a file called "wired.py", it'll produce a file:
Wired RSS [25 Nov 2007 1720].lrf (for example).

I think it's producing pretty clean text (most ads, links, banners, comments, cruft are removed) for reading off-line, but there are still some fomatting issues (some fonts too big, others way too small, maybe I need to kill all CSS info in the <header> sections completely?).

BTW, if you make any changes to the user profile wired.py file, before running the web2lrf command, delete the previously generated wired.pyc file or your changes won't be reflected (I think).

Any suggestions for cleaning up the text formatting? Give it a try.

Dave

-------


# coding: ISO-8859-1
## Copyright (C) 2007 David Chen SonyReader<at>DaveChen<dot>org
##
## This program is free software; you can redistribute it and/or modify
## it under the terms of the GNU General Public License as published by
## the Free Software Foundation; either version 2 of the License, or
## (at your option) any later version.
##
## This program is distributed in the hope that it will be useful,
## but WITHOUT ANY WARRANTY; without even the implied warranty of
## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
## GNU General Public License for more details.
##
## Version 0.6-2007-11-27
## Based on newsweek.py, bbc.py, nytimes.py by Kovid Goyal
## https://libprs500.kovidgoyal.net/wiki/UserProfiles
##
## Usage:
## >web2lrf --user-profile wired.py
## Comment out the RSS feeds you don't want in the last section below
##
## Output:
## Wired [YearMonthDate Time].lrf
##
'''
Profile to download RSS News Feeds and Articles from Wired.com
'''

import re

from libprs500.ebooks.lrf.web.profiles import DefaultProfile

class wired(DefaultProfile):

title = 'Wired'
max_recursions = 2
timefmt = ' [%Y%b%d %H%M]'
html_description = True
no_stylesheets = True

## Don't grab articles more than 7 days old
oldest_article = 7

preprocess_regexps = [(re.compile(i[0], re.IGNORECASE | re.DOTALL), i[1]) for i in
[

## Remove any banners/links/ads/cruft before the body of the article.
(r'<body.*?((<div id="article_body">)|(<div id="st-page-maincontent">)|(<div id="containermain">)|(<p class="ap-story-p">)|(<!-- img_nav -->))', lambda match: '<body><div>'),

## Remove any links/ads/comments/cruft from the end of the body of the article.
(r'((<!-- end article content -->)|(<div id="st-custom-afterpagecontent">)|(<p class="ap-story-p">&copy;)|(<div class="entry-footer">)|(<div id="see_also">)|(<p>Via <a href=)|(<div id="ss_nav">)).*?</html>', lambda match : '</div></body></html>'),

## Correctly embed in-line images
(r'<a.*?onclick.*?>.*?(<img .*?>)', lambda match: match.group(1),),

## Correct the apostrophe character so it renders well in LRF
(r'’', lambda match: "'"),
]
]

## Use the single page Print version of a page when available.
## Not all RSS entries have Print versions, ie. ones hosted on the blog.wired.com URL's

def print_version(self, url):
return url.replace('http://www.wired.com/', 'http://www.wired.com/print/')

## Comment out the feeds you don't want retrieved.
## Or add any new new RSS feed URL's here

def get_feeds(self):
return [
('Top News', 'http://feeds.wired.com/wired/index'),
('Culture', 'http://feeds.wired.com/wired/culture'),
('Software', 'http://feeds.wired.com/wired/software'),
('Mac', 'http://feeds.feedburner.com/cultofmac/bFow'),
('Gadgets', 'http://feeds.wired.com/wired/gadgets'),
('Cars', 'http://feeds.wired.com/wired/cars'),
('Entertainment', 'http://feeds.wired.com/wired/entertainment'),
('Gaming', 'http://feeds.wired.com/wired/gaming'),
('Science', 'http://feeds.wired.com/wired/science'),
('Med Tech', 'http://feeds.wired.com/wired/medtech'),
('Politics', 'http://feeds.wired.com/wired/politics'),
('Tech Biz', 'http://feeds.wired.com/wired/techbiz'),
('Commentary', 'http://feeds.wired.com/wired/commentary'),
]

veshman
11-26-2007, 11:52 AM
Dave,

thanks! i'll give it a try and post my results.

bhavesh

veshman
11-26-2007, 11:55 AM
Kovid,

thanks for the link...it is very helpful. I'll try a couple of the expressions out.

bhavesh

kovidgoyal
11-28-2007, 02:18 AM
version 0.4.25 finally implements support for The Economist. See demo attached to first post.

DaveNB
11-28-2007, 02:31 AM
I edited the previous post to reflect the changes in the source code for the newest wired.py User Profile for web2lrf.

There is major improvement in the proper rendering/placement of inline images and proper display of inline hypertext linked phrases/words.

However, there are some issues with text encoding that to fix the problems with the apostrophe's (sometimes Wired uses a simple vertical tic, sometimes they use the apostrophe where the tail curves down to the left, the latter renders strangely as 3 international characters on the Sony Reader). Version 0.6 attempts to fix this but so far, I can't seem to get the right character/hex sequence for the problematic apostrophe character (right single quote) to substitute it out.

Give it a try and let me know if any one can figure out how to fix the apostrophe problem.

Dave

kovidgoyal
11-28-2007, 02:56 AM
The problem with wired is that the files are encoded in UTF8 but they specify the encoding as iso8859-1. You can try either
1) Contact wired
2) write a preprocess regexp that changes the specified encoding

(r'<meta http-equiv="Content-Type" content="text/html; charset=(\S+)"',
lambda match : match.group().replace(match.group(1), 'UTF-8'))

DaveNB
11-28-2007, 03:27 AM
The problem with wired is that the files are encoded in UTF8 but they specify the encoding as iso8859-1. You can try either
1) Contact wired
2) write a preprocess regexp that changes the specified encoding

(r'<meta http-equiv="Content-Type" content="text/html; charset=(\S+)"',
lambda match : match.group().replace(match.group(1), 'UTF-8'))


I see, I tried changing the wired.py to specify a iso8859-1 encoding, but this didn't fix the problem, the apostrophes are still funny...will keep hacking at it. Also tried searching for the exact hex sequence that is causing trouble and replacing it with a normal apostrophe without success:

(r'\xE2\x80\x99', lambda match: "'"),



Any ideas?

Dave

kovidgoyal
11-28-2007, 03:49 AM
I'm not sure that regexp is correct, use --keep-downloaded-files to make sure, it's actually being applied.

DaveNB
11-28-2007, 04:54 AM
Yeah, I wasn't so sure about that regex either, but your previous suggestion of correcting Wired.com's claimed encoding to UTF-8 worked perfectly, didn't even have to search for the errant pattern and correct it. Hopefully this will fix all accented characters as well (they were showing up funny after the LRF conversion).

Version 0.7 now being put up on Kovid's Wiki for custom user profiles for web2lrf. It's alot easier to post the changes in just one place that way.
https://libprs500.kovidgoyal.net/wiki/UserProfiles
Apostrophe's fixed.

Dave

FixB
11-28-2007, 06:38 AM
I'm not sure that regexp is correct, use --keep-downloaded-files to make sure, it's actually being applied.
That's the command line option I was looking for !!! Thanks !!

DaveNB
11-30-2007, 08:03 AM
I wrote up a HOWTO and posted it to Kovid's libprs500 page
OK, I put up a quick and dirty and hopefully helpful HOWTO here:
https://libprs500.kovidgoyal.net/wiki/UserProfiles

Hope that helps.

Dave

StDo
11-30-2007, 12:29 PM
I wrote up a HOWTO and posted it to Kovid's libprs500 page
OK, I put up a quick and dirty and hopefully helpful HOWTO here:
https://libprs500.kovidgoyal.net/wiki/UserProfiles

Hope that helps.

Dave

Great! Thanks.

Maybe Kovid can create a link to a new page with your HOWTO.

The User Profile page is getting too big... :)

JTravers
11-30-2007, 12:47 PM
I wrote up a HOWTO and posted it to Kovid's libprs500 page
OK, I put up a quick and dirty and hopefully helpful HOWTO here:
https://libprs500.kovidgoyal.net/wiki/UserProfiles

Hope that helps.

Dave

Great stuff. This is so very helpful.

One question I wanted to ask Kovid, you, or anyone else with more experience building profiles. Is it possible to setup a profile in order to clean up a regular web page that links to the content you want? Or are the profiles strictly for use with RSS feeds?

Thanks!

kovidgoyal
11-30-2007, 02:08 PM
@DaveNB
Thanks, I've moved your HOWTO to a separate page that is referenced from UserProfiles https://libprs500.kovidgoyal.net/wiki/UserProfilesHOWTO
That way you can address it directly and the UserProfiles page doesn't become too long.

@JTravers
The behavior of web2lrf is fully customizable.
You would need to re-define the build_index function in your profile to simply return the path to the pre-built index file.

veshman
12-01-2007, 12:35 AM
Hey guys, I just wanted to say congrats on getting Wired done!....even though I haven't had a chance to tinker with it all week, I am keeping up with the progress.

I'll try to use the HOWTO dave posted to work on The Atlantic, www.theatlantic.com, one of my favorite reads. I was having some trouble picking out the links last time I tried, I think, but I'll give it another go.

bhavesh

StDo
12-01-2007, 07:52 PM
Can anybody give me a hint how to insert "druck-" after the penultimate comma of the following link?

http://www.spiegel.de/sport/sonst/0,1518,520867,00.html

It should look like this afterwards:
http://www.spiegel.de/sport/sonst/0,1518,druck-520867,00.html

So where are the phyton specialists? :D

kovidgoyal
12-01-2007, 08:15 PM
tokens = url.split(',')
tokens[-2:-1] = ['-druck']
url = ','.join(tokens)

that's just off the top of my head you'll almost certainly have to modify it to make it work correctly.

StDo
12-02-2007, 07:23 AM
tokens = url.split(',')
tokens[-2:-1] = ['-druck']
url = ','.join(tokens)


Where do I have to implement that?

Somewhere after:
def get_feeds(self):
return [ ('Spiegel Online', 'http://www.spiegel.de/schlagzeilen/rss/0,5291,,00.xml') ]

def print_version(self,url):
return url.replace

Tried different versions, however I am getting an error:
SyntaxError: invalid syntax

kovidgoyal
12-02-2007, 02:20 PM
def print_version(self,url):
tokens = url.split(',')
tokens[-2:-1] = ['-druck']
return ','.join(tokens)

StDo
12-02-2007, 03:24 PM
Hmm.


tokens[-2:-1] = ['-druck']
-this-is--a-spaceholder--^
IndentationError: unindent does not match any outer indentation level

He does not like the "]"

:blink:

kovidgoyal
12-02-2007, 03:31 PM
Just retype the function making sure that the indentation is all spaces and equal

StDo
12-02-2007, 03:58 PM
That's it. Thanks.

By the way, how can I provide the skipping of an article without publication date?

[DEBUG] __init__.pyo:172: Skipping article as it does not have publication date
[DEBUG] __init__.pyo:172: Skipping article as it does not have publication date

kovidgoyal
12-02-2007, 04:30 PM
I'm not sure what you mean? You want to include articles that don't have a publication date? In that case, the only way to do it is to redefine the parse_feeds function in your profile.

StDo
12-02-2007, 04:50 PM
Kovid, i tried to get the spiegelde.py running.

spiegelde.py:
from libprs500.ebooks.lrf.web.profiles import DefaultProfile

import re

class SpiegelOnline(DefaultProfile):

title = 'Spiegel Online'
timefmt = ' [ %Y-%m-%d %a]'
max_recursions = 1
max_articles_per_feed = 40
html_description = True
no_stylesheets = True


def get_feeds(self):
return [ ('Spiegel Online', 'http://www.spiegel.de/schlagzeilen/rss/0,5291,,00.xml') ]

def print_version(self,url):
tokens = url.split(',')
tokens[-2:-1] = ['-druck']
return ','.join(tokens)




But the spiegel.de RSS feed shows the time format only as "Heute um 20:00 Uhr" (that means: "Today at 8 p.m.").

See: http://www.spiegel.de/schlagzeilen/rss/0,5291,,00.xml

kovidgoyal
12-02-2007, 04:56 PM
Then you will have to redefine the function strptime. The function takes a string argument and should return the number of seconds since the epoch (Jan 1 1970) in the GMT time zone.

something like


def strptime(self, src):
# Some code to convert the string src into a datetime
# This is a dummy implemetation that just returns the current time
return time.time()

StDo
12-02-2007, 05:51 PM
Seems to be hard work, will try to config it in a few days...

Can't I tell web2lrf that it should take all articles shown, because there seems to be only roundabout 40-50 articles at spiegel.de

kovidgoyal
12-02-2007, 05:55 PM
Just define the dummy strptime function as show above and that will do this.

StDo
12-02-2007, 06:14 PM
Sorry, getting the same error...


'''
Fetch Spiegel Online.
'''

from libprs500.ebooks.lrf.web.profiles import DefaultProfile

import re

class SpiegelOnline(DefaultProfile):

title = 'Spiegel Online'
timefmt = ' [ %Y-%m-%d %a]'
max_recursions = 2
max_articles_per_feed = 40
# html_description = True
# no_stylesheets = True


def get_feeds(self):
return [ ('Spiegel Online', 'http://www.spiegel.de/schlagzeilen/rss/0,5291,,00.xml') ]

def strptime(self, src):
# Some code to convert the string src into a datetime
# This is a dummy implemetation that just returns the current time
return time.time()

def print_version(self,url):
tokens = url.split(',')
tokens[-2:-1] = ['-druck']
return ','.join(tokens)

kovidgoyal
12-02-2007, 06:28 PM
Ah I see that the feed has no publication date. OK. I've added a use_pubdate variable (in svn). Set it to False to prevent web2lrf from trying to figure out the publication date


use_pubdate = False

JTravers
12-03-2007, 06:33 AM
I have a profile setup for WSJ.com. I'm trying to get it configured to work with subscription content (only for those that have a valid paid subscription, of course).

The problem is that WSJ.com does not allow multiple, concurrent logins. If it detects multiple, concurrent logins, your account is subsequently locked until you call customer service.

So the 1st time I logged in through the web2lrf profile, everything worked and downloaded properly. However, every subsequent time I tried using the profile, the login didn't work (account was locked), so only non-subscription content was downloaded.

In order to prevent this, I believe one needs to log out of the site before exiting web2lrf. Is there way to logout of a site using web2lrf? Perhaps the same kind of functionality as the login, but it would be processed at the end of the process instead of the beginning.

This dilemma also applies to the Barrons.com site (since they are under the same umbrella as the WSJ.com). My profile for this only worked a couple times before I got locked out of the site.

Thanks for your help with this.
(.txt extension added to facilitate the upload)

kovidgoyal
12-03-2007, 12:55 PM
I've added a cleanup method to the profile that's called after the LRF file has been generated. You can use self.browser to logout in that method.

JTravers
12-03-2007, 05:26 PM
I've added a cleanup method to the profile that's called after the LRF file has been generated. You can use self.browser to logout in that method.

Thank you so much for adding this.

I'm going to need some help on the proper code to use, though, due to my ignorance of python.

Would adding something like this to my profile work?

def cleanup(self):
return [
self.browser.open('http://online.barrons.com/logout')
]

Thanks for your help with this.

One other question for you, if you don't mind. How do you add the --ignore-tables option to the profile, so you don't have to specify it on the command-line every time you use the profile?

Thanks again.

kovidgoyal
12-03-2007, 06:12 PM
Yeah that should do it, no need to return anything though.

Use

html2lrf_options = ['--ignore-tables']

StDo
12-03-2007, 06:42 PM
def print_version(self,url):
tokens = url.split(',')
tokens[-2:-1] = ['druck-']
return ','.join(tokens)



Kovid,
that snippet you gave me replaces the numbers between the last comma and the second last comma with "druck-". But the numbers there should remain and "druck-" should be added in front of the numbers and after the second last comma.

The original link:
http://www.spiegel.de/panorama/justiz/0,1518,521183,00.html
should be
http://www.spiegel.de/panorama/justiz/0,1518,druck-521183,00.html
and not (as it will be done with the snippet above):
http://www.spiegel.de/panorama/justiz/0,1518,druck-,00.html

Thanks for thinking and coding. :)

JTravers
12-03-2007, 08:03 PM
Yeah that should do it, no need to return anything though.

Use

html2lrf_options = ['--ignore-tables']


When trying the cleanup code, web2lrf hangs right after generating the lrf. I used the following code:
def cleanup(self):
self.browser.open('http://online.barrons.com/logout')

For Barron's, I have to set max recursions to 3 because there are some articles that are divided into two parts (even the print versions). Doing this, however, causes web2lrf to follow a bunch of other links which end up being garbage and taking it off the Barron's website. Is there a way to restrict the links that web2lrf follows? I've tried the following, but it didn't seem to work:

match_regexps = ['<a.*?mod=.*?>']
and I also tried:
match_regexps = ['<a.*?online.barrons.com.*?>']

It doesn't seem like either is having an effect. I know I'm probably misusing these options, so any guidance would be appreciated.

Finally, I tried using html2lrf_options before (and again now), and it doesn't seem to give the same output that is generated when specifying --ignore-tables on the command line. Not sure why.

kovidgoyal
12-03-2007, 09:05 PM
@StDo
Oops sorry. Here you go

def print_version(self,url):
tokens = url.split(',')
tokens[-2:-2] = ['druck|']
return ','.join(tokens).replace('|,','-')


@JTravers
match_regexp works on the contents of the href attribute, i.e. the URL itself, not on the <a> tag. As for html2lrf_options, looks like a regression, they aren't being applied. Will be fixed in the next release.
Not sure why the cleanup code should hang, I'll look at that later.

kovidgoyal
12-03-2007, 09:27 PM
@JTravers

Just realized I can't look at the cleanup code as I don't have a subscription. Try the following to debug


def cleanup(self):
res = self.browser.open('whatever the url was')
print res.read()

JTravers
12-04-2007, 01:37 AM
@JTravers
match_regexp works on the contents of the href attribute, i.e. the URL itself, not on the <a> tag.

Here's the code I'm using for the link regexp:
match_regexps = ['http://online.barrons.com/.*?html\?mod=.*?']

But I can see webpages being fetched from entirely different domains than barrons.com. I've attached my profile for Barrons. You should be able to test it (at your convenience, of course) without supplying a username and password, as there are some articles that are available to non-subscribers.

JTravers
12-04-2007, 02:12 AM
@JTravers

Just realized I can't look at the cleanup code as I don't have a subscription. Try the following to debug


def cleanup(self):
res = self.browser.open('whatever the url was')
print res.read()


Still hangs -- both when I login and when I don't. If you have the time to check, you should be able to test even without logging in. You can use my profile from the prior post.

kovidgoyal
12-04-2007, 05:17 PM
Hmm another regression was preventing match_regexps from working. Fixed in svn. Note that in your case match regexps should be

match_regexps = ['http://online.barrons.com/.*?html\?mod=.*?|file://.*']

As for the cleanup hanging it seems to be following a long redirect chain

Use the following code to see the HTTP responses being sent by the server


def cleanup(self):
try:
self.browser.set_debug_responses(True)
import sys, logging
logger = logging.getLogger("mechanize")
logger.addHandler(logging.StreamHandler(sys.stdout ))
logger.setLevel(logging.INFO)

res = self.browser.open('http://online.barrons.com/logout')
except:
import traceback
traceback.print_exc()


You may find the documentation at http://wwwsearch.sourceforge.net/mechanize/ useful for understanding how the browser object works.

JTravers
12-04-2007, 05:41 PM
Thanks for all of your help, Kovid.

I'll take a look at the code and link you recommended and see if I can come up with a solution.

Once that's all worked out, the profiles I made for WSJ.com and Barrons.com should be pretty much done.

I'll probably start working on other finance/investment sites after that. (The WSJ.com blogs should be pretty easy to implement -- and they're free, too!).

JTravers
12-05-2007, 04:21 AM
What does the following error mean?

Traceback (most recent call last):
File "convert_from.py", line 187, in <module>
File "convert_from.py", line 181, in main
File "convert_from.py", line 123, in process_profile
File "libprs500\ebooks\lrf\web\profiles\__init__.pyo", line 92, in __init__
File "libprs500\ebooks\lrf\web\profiles\__init__.pyo", line 104, in build_index
File "libprs500\ebooks\lrf\web\profiles\__init__.pyo", line 159, in parse_feeds
ValueError: too many values to unpack

I get it when trying to process the following feed:
http://feeds.portfolio.com/portfolio/businessspin

Thanks.

kovidgoyal
12-05-2007, 04:39 AM
That means the get_feeds function is not returning a correct sequence.

JTravers
12-05-2007, 04:44 AM
I'm trying to setup profiles for some full content feeds, in which I go no further than listing the articles with descriptions (since the descriptions in the feed contain the full content). However, I noticed that linked text in a feed description is removed.

I know html2lrf had a regression which removed linked text completely (which you have already fixed). So I thought maybe this was a regression, too. If not, perhaps you could set it up so that it just strips the links from the descriptions but keeps the text in place.

Thanks.

JTravers
12-05-2007, 04:47 AM
That means the get_feeds function is not returning a correct sequence.

User error on my part. I forgot a comma between the feed title and URL. :oops2:

kovidgoyal
12-05-2007, 12:53 PM
Can you give me an example of such a feed, so I can debug.

JTravers
12-05-2007, 05:17 PM
Can you give me an example of such a feed, so I can debug.

Here's one from the profile I was working on.
http://feeds.portfolio.com/portfolio/businessspin

I've attached the lrf generated from the profile, so you can see the results.

kovidgoyal
12-05-2007, 05:28 PM
Ah ok should be fixed in svn, let me know if if still gives you trouble.

JTravers
12-05-2007, 11:09 PM
Whenever I set max_recursions to 0 or 1 in a profile, I get the following error after the lrf is generated:
Exception exceptions.WindowsError: WindowsError(32, 'The process cannot access
the file because it is being used by another process') in <bound method Portfolio.__del__ of
<portfolio.Portfolio object at 0x00FCFCF0>> ignored

If I then set max_recursions to 2 or more, the error goes away.

kovidgoyal
12-05-2007, 11:35 PM
That error can be safely ignored, all it means is that some temporary file was not deleted.

JTravers
12-06-2007, 04:19 AM
Just to let everyone know, I posted profiles for the Wall Street Journal, Barron's, and Portfolio.com on Kovid's wiki.
https://libprs500.kovidgoyal.net/wiki/UserProfiles

Subscribers to WSJ and Barron's should be able to get all the content using the --username and --password options in web2lrf. Non-subscribers will get the free articles only.

Be aware that because of the peculiarities of how concurrent logins are handled at the WSJ and Barron's sites, you may get locked out of your account for a short period of time using the WSJ and Barrons profiles. You would probably have to run the profiles (with login credentials) multiple times before this happens, though. So if you're only running it once within a reasonable period of time, you should be safe.

StDo
12-16-2007, 06:08 PM
Just to let everyone know, I posted a profile for "Dilbert" - the dayly comicstrip on Kovid's wiki.
https://libprs500.kovidgoyal.net/wiki/UserProfiles

Thanks to Stenis (http://www.mobileread.com/forums/member.php?u=8609) - it is his favourite feed. :)

JTravers
12-17-2007, 04:37 AM
Thanks for the Dilbert profile.
What a great idea!

StDo
12-17-2007, 03:56 PM
Thanks for the Dilbert profile.
What a great idea!

You are welcome. :)

Btw. let the karma grow! :thumbsup:

secretsubscribe
01-09-2008, 10:32 PM
Hello
I'm in the process of developing a profile to log in and download articles from thenation.com.
The Nation doesn't have an RSS feed for their monthly articles. They have feeds for Most Emailed, Top Stories, etc.. But I want to download the current month's "Magazine."
What's helpful is that they the month's articles (those included in print AND web only articles) are located @ http://www.thenation.com/issue/YYYYMMDD
The individual articles are located at http://www.thenation.com/doc/YYYYMMDD/author_name.

So I was able to scrape out all the urls for for the articles.
Then in trying to figure out what to do next, I decided to take those URLs and create an rss xml file on my local drive (c:\program files\libprs500\nation.xml),
that i then returned at the end of the profile:
return [('feed1','file:///c:/program%20files/libprs500/nation.xml')]

I worked!
Now i need figure out how to extract the article titles and descriptions and make the proper replacements to get the print versions of the articles instead.

But the main reason I'm posting it to ask if creating and accessing the local rss file is the way to go. This would be a lot more convinient to anyone interested if the profile script didn't have to worry about generating files and directory structures.
Just started to take a look at this a few days ago and its the first time I try my hand at python so thanks for any help in advance.

kovidgoyal
01-09-2008, 11:06 PM
Creating an XML file will work, it is the least python intensive solution. However, you can also just override the parse_feeds() function. It should return a list of dictionaries. Each dictionary should be of the form


{
'title' : article title,
'url' : URL of print version,
'date' : The publication date of the article as a string,
'description' : A summary of the article
}

secretsubscribe
01-10-2008, 02:47 AM
Hello
Instead of overriding the get_feeds, i've attempted to override the parse_feeds function.
I create the list of dictionaries and return it.
Now I get this message:
File "convert_from.py", line 198, in <module>
File "convert_from.py", line 192, in main
File "convert_from.py", line 131, in process_profile
File "libprs500\ebooks\lrf\web\profiles\__init__.pyo", line 93, in __init__
File "libprs500\ebooks\lrf\web\profiles\__init__.pyo", line 127, in build_index
AttributeError: 'list' object has no attribute 'keys'

thank you

kovidgoyal
01-10-2008, 11:19 AM
Oh I'm sorry, what needs to be returned is a dictionary whose keys are feed titles (like Business, National News, etc) and whose values are athe list of dictionaries I mentioned before.

shempe
01-10-2008, 12:15 PM
Hi there

here is a quickndirty snippet from me

for germany heise newsticker

its working fine for me

import re

from libprs500.ebooks.lrf.web.profiles import DefaultProfile

class heise (DefaultProfile):

title = 'Heise Newsticker'
max_recursions = 2
use_pubdate = False
no_stylesheets = True
max_articles_per_feed = 30


preprocess_regexps = [ (re.compile(i[0], re.IGNORECASE | re.DOTALL), i[1]) for i in [
(r'<!-- Site Navigation Bar -->.*?<title>', lambda match : '<title>'),
(r'</title>.*?</head>', lambda match : '</title> </head>'),
(r'<!-- allgemeine obere Navigation -->.*?</heisetext>', lambda match : ''),
(r'<table.*?</table>', lambda match : ''),
(r'<br clear="all".*?</body>', lambda match : '</div> </body>')
] ]

def get_feeds(self):
return [ ('Heise Newsticker', 'http://www.heise.de/newsticker/heise.rdf') ]

def print_version(self, url):
return url.replace('http://www.heise.de/newsticker/meldung/', 'http://www.heise.de/newsticker/meldung/print/')


have fun
Stefan

kovidgoyal
01-10-2008, 12:22 PM
You should add it to https://libprs500.kovidgoyal.net/wiki/UserProfiles so other people can find and use it. You'll need to create ana ccount and let me know the user name so I can give you write permission for the wiki.

secretsubscribe
01-10-2008, 01:43 PM
Oh I'm sorry, what needs to be returned is a dictionary whose keys are feed titles (like Business, National News, etc) and whose values are athe list of dictionaries I mentioned before.

Fantastic! It works. Just need to polish a few things as much as i currently am able and then I'll post the profile.

Finally being able the read the Nation every month and get the New York Times every morning adds so much value to my Sony Reader (I might be able to convince others to buy one.)

Thanks for all your work and help.

shempe
01-11-2008, 11:42 AM
I posted a new profile for German Golem News and update my Heise Newsticker

look at:

https://libprs500.kovidgoyal.net/wiki/UserProfiles


Stefan

cartz
01-11-2008, 04:05 PM
Fantastic! It works. Just need to polish a few things as much as i currently am able and then I'll post the profile.


I look forward to your posting so I can use it as a template for a newspaper I'd like to get working. It has a text only edition of the paper that has an index page and all articles a single link from that. http://www.theage.com.au/text/

I know nothing of python or html and have tried experimenting but realize I need to see a working example from a non-RSS feed profile. Otherwise I think it should be quite simple because the layout of the text version of the paper is already very Sony reader friendly.

I don't have my Sony Reader yet. I ordered it yesterday (shipping to Australia) but figure trying to sort this out is a good way to pass my waiting time :)

StDo
01-14-2008, 04:25 PM
I posted a new profile for German Golem News and update my Heise Newsticker

look at:

https://libprs500.kovidgoyal.net/wiki/UserProfiles


Stefan

Super! :thanks: :thumbsup:

Nur weiter so! :-)

Magst du dich mal an die Sueddeutsche.de wagen... ;)

Oder an fscklog.com oder mactechnews.de...

slav
01-16-2008, 06:39 AM
Hi All!

I have a problem converting one RSS feed - the problem is with &lt; and &gt; (feed is full of that).

I tried to write regex like:
(r'(&lt;)(.*?&gt;)', lambda match : '<code>' + match.group(1) + match.group(2) + '</code>'),

but it doesn't work (I'm not a regex wizard :-)

can anyone help me with that?

kovidgoyal - big thanx for your work on this program !

kovidgoyal
01-16-2008, 12:43 PM
What's the problem with &lt; and &gt;? Are they not being converted correctly?

slav
01-17-2008, 04:50 AM
the problem is that they are being converted, so they produce unknown tags like:

&lt;ThatsMyXMLTag&gt; text inside my tag &lt;/ThatsMyXMLTag&gt;

produces in output temp html:

<ThatsMyXMLTag> text inside my tag </ThatsMyXMLTag>

and then web2lrf tries to convert that to lrf and nothing is displayed (at least that's what I think)

I saw in demo.html file that you put this into <code> tags, that's why I was trying this regex...

thanks!

kovidgoyal
01-17-2008, 11:58 AM
Unknown tags in an HTML file are ignored, i.e. html2lrf treats <unknown>some text</unknown> as some text. So I don't think that is the problem. Are the &lt; entities in the HTML or the RSS feed itself?

slav
01-18-2008, 07:04 AM
in RSS feed itself see this feed for example:
http://feeds.feedburner.com/netslave

Add or remove the www sub domain post contains lots of source code
all sections like:

<httpModules>
<add type="WwwSubDomainModule" name="WwwSubDomainModule" />
</httpModules>


are not in output LRF file.

even the

/// <summary>
/// Handles the BeginRequest event of the context control.
/// </summary>
/// <param name="sender">The source of the event.</param>


in output LRF appears as:

/// Handles the BeginRequest event of the context control.


note that it happens even if I dont have preprocess_regexps defined at all.

kovidgoyal
01-18-2008, 12:05 PM
preprocess_regexps only acts n the downloaded HTML files. not on the RSS file iteslf. If you want to change the handling on the <description> tag in the RSS file do two things

set


html_description = True


If you still dont like the handling, override the process_html method in your sub class.

slav
01-18-2008, 12:43 PM
I'm not concerned about description, my only problem is that some lines are missing from output LRF file, but as you say I'll try to override process_html and see how it goes.

Thanks!

kovidgoyal
01-18-2008, 12:47 PM
The contents of the LRF file are taken from the <description> tag, so you should be concerned about it :)

Dominik
01-21-2008, 03:19 PM
Hi Kovid,

is it possible to use web2lrf with a full feed? For example, all Feedburner feeds have <content:encoded>-tags containing the whole article. Therefore, it is unnecessary to look for a print version of the article and preprocess the HTML.

How can I get web2lrf to use the <content:encoded> instead of the article's URL?

I tried to set the "html_description" property to true and reimplement the parse_feed function to use the <content:encoded>-tag instead of <description>. This worked, but it's complicated and it's impossible to look over the articles quickly because there is no table of contents with links to the full article.

Dominik

kovidgoyal
01-21-2008, 03:21 PM
Support for content-embedded feeds is on my TODO list. It's now added in svn, will be in the next release.

slav
01-24-2008, 06:18 AM
Support for content-embedded feeds is on my TODO list. It's now added in svn, will be in the next release.

Any ideas when you'll be ready with this new release - I'd love to put my hands on new web2lrf :D

btw. I've seen new version of DefaultProfile in svn - is there a way to force existing version of web2lrf to use it?

kovidgoyal
01-24-2008, 01:23 PM
Not easily. The new release should be out soon.

slav
01-24-2008, 02:08 PM
thanks, I'll probably wait then :thumbsup:

kovidgoyal
01-24-2008, 04:10 PM
Released v0.4.34 with a GUI for adding custom profiles and support for content embedded profiles via the class FullContentProfile

slav
01-24-2008, 05:28 PM
:thanks:

AJ@PR
01-24-2008, 05:39 PM
Hello everyone...

New to the forums. :)

Just downloaded the software... gonna give it a go now. :)
Will update soon.
Thank you for this!

-- AJ
/



EDIT UPDATE:::
Ok, installed it.
1- Thanks! This thing is nifty!
2- Wow! I just found out the Reader has 200MB internal. =\
3- FINALLY I can edit the meta data of the files... w00t w00t!! !
4- I went to add a directory on another hdd (non C:\>), and it error'd me:
directories
Detailed traceback:
Traceback (most recent call last):
File "main.py", line 723, in do_config
AttributeError: directories

Let me know if I can help in any way!
(no programing skills)

AJ@PR
01-24-2008, 06:07 PM
^^^ Ok, no worries! :)

Everything works fine... well, "fine" being a relative term.
I love the anti-gravity sand. :D

Again, THANKS!

kovidgoyal
01-24-2008, 06:14 PM
If you can't program you can always help by doing translations/documentation. See https://libprs500.kovidgoyal.net/wiki/Development

AJ@PR
01-24-2008, 08:36 PM
If you can't program you can always help by doing translations/documentation. See https://libprs500.kovidgoyal.net/wiki/Development

Hmm.... seems I can help there.
:thumbsup:

Will contribute to the Spanish one... :)

slav
01-25-2008, 08:01 AM
I've just tried to convert few feeds and I noticed that none of the images are in output LRF file (is there something I need to set to get them?). besides that embedded content looks great!

one more thing - how do I call this FullContentProfile from custom user profile (using command line)

thanks!

kovidgoyal
01-25-2008, 11:36 AM
Just create a custom user profile that inherits from FullCOntentProfile. As for images, run with the --verbose switch to find out why they aren't being included. You can copy paste the code from the Advanced tab into a .py file and run it with web2lrf from the command line with --verbose

dgallina
01-25-2008, 02:19 PM
What a great update!

Thanks for all the work you put into this!

Diego

ddavtian
01-25-2008, 08:42 PM
Kovid, thanks for all your work.

I yesterday upgraded to 0.4.34 on my laptop, libprs becomes better with each release.

I do have a question: was anything changed in the profile of Wall Street Journal? I'm a subscriber and was getting very good lrf in 0.4.33. Now on my laptop it takes very long time to get it, size is much bigger, and reader is not able to finish formatting. It's formatting for 15-20 minutes, then gets back to the library page. I viewed it on the laptop, it had generated 18,000 + pages, with huge fonts. During the view libprs was using 600Mb of memory (running in WinXP). My desktop still runs 0.4.33, and it creates 3-4Mb book, very well formatted.

Do you have previous builds available for download? The latest one is better but I'd love to gt the WSJ back on a laptop.

Thanks again,
David

kovidgoyal
01-25-2008, 08:48 PM
Well the web2lrf subsystem had some work done on it, so perhaps that affected the WSJ profile. Unfortunately, I'm not the one that wrote that profile, and I don't have a WSJ account with which to debug it. If you're willing to PM me your account info, I could try to debug it.

Unfortunately, I don't keep previous releases around.

ddavtian
01-25-2008, 08:54 PM
If you're willing to PM me your account info, I could try to debug it.

Kovid, I just sent you a PM.

kovidgoyal
01-30-2008, 11:44 PM
released version 0.4.35 with a fixed WSJ profile and new profiles for:

The Atlantic, The Christian Science Monitor, Reuters, The Jerusalem Post (thanks to Deputy-Dawg for the last three)

JTravers
01-31-2008, 10:45 AM
I do have a question: was anything changed in the profile of Wall Street Journal? I'm a subscriber and was getting very good lrf in 0.4.33. Now on my laptop it takes very long time to get it, size is much bigger, and reader is not able to finish formatting. It's formatting for 15-20 minutes, then gets back to the library page. I viewed it on the laptop, it had generated 18,000 + pages, with huge fonts. During the view libprs was using 600Mb of memory (running in WinXP). My desktop still runs 0.4.33, and it creates 3-4Mb book, very well formatted.

Do you have previous builds available for download? The latest one is better but I'd love to gt the WSJ back on a laptop.

David,
I created the original WSJ profile. Not sure why it stopped working for you in the 0.4.34 build. I've attached a copy of the profile that has been working for me consistently with all the builds old and new (be sure to remove the .txt extension).

Try it out and let me know if it works for you.

ddavtian
01-31-2008, 10:52 AM
Kovid, I got the 0.4.35, all my favorite profiles (WSJ, Newsweek, Economist, NYT) are working fine. I don't know what was wrong on my machine with 4.34, but now all is good.

Thanks a lot,
David

ddavtian
01-31-2008, 10:55 AM
David,
Try it out and let me know if it works for you.

JTravers, thanks a lot for the profile. As I mentioned, it could be something wrong on my machine. Sorry if false alarm. The latest build worked fine again.

I love showing the reader to my colleagues with WSJ and NYT :-) First, I demo the iLiad with technical books, then Sony with newspapers :-)

JTravers
01-31-2008, 10:59 AM
Kovid,
I get the following error when trying to implement the FullContentProfile with the portfolio.py profile (both my own and the one bundled with 0.4.35).

Traceback (most recent call last):
File "convert_from.py", line 192, in <module>
File "convert_from.py", line 186, in main
File "convert_from.py", line 125, in process_profile
File "libprs500\ebooks\lrf\web\profiles\__init__.pyo", line 100, in __init__
File "libprs500\ebooks\lrf\web\profiles\__init__.pyo", line 422, in build_index
IOError: [Errno 2] No such file or directory: '/tmp/category1.html'

Do you know what's causing this?
(I'm using this under Vista.)

Thanks!

JTravers
01-31-2008, 11:06 AM
JTravers, thanks a lot for the profile. As I mentioned, it could be something wrong on my machine. Sorry if false alarm. The latest build worked fine again.

Great! Good to know everything is working for you again.

If you don't mind, I would be curious to know what the speed is like for you using the built-in WSJ profile vs. the one I attached to my previous message. The built-in profile seems to take a much longer time on my system and was wondering if the same applies to you. Maybe it's just a GUI vs. command line thing, though.

Thanks!

kovidgoyal
01-31-2008, 01:48 PM
Kovid,
I get the following error when trying to implement the FullContentProfile with the portfolio.py profile (both my own and the one bundled with 0.4.35).

Traceback (most recent call last):
File "convert_from.py", line 192, in <module>
File "convert_from.py", line 186, in main
File "convert_from.py", line 125, in process_profile
File "libprs500\ebooks\lrf\web\profiles\__init__.pyo", line 100, in __init__
File "libprs500\ebooks\lrf\web\profiles\__init__.pyo", line 422, in build_index
IOError: [Errno 2] No such file or directory: '/tmp/category1.html'

Do you know what's causing this?
(I'm using this under Vista.)

Thanks!

Oops left in a statement for debugging, I've re-uploaded the windows installer. Re-install and you should be fine.

ddavtian
02-01-2008, 12:49 AM
If you don't mind, I would be curious to know what the speed is like for you using the built-in WSJ profile vs. the one I attached to my previous message. The built-in profile seems to take a much longer time on my system and was wondering if the same applies to you. Maybe it's just a GUI vs. command line thing, though.

Thanks!

I just tested: 36 minutes using the GUI, 5 minutes using web2lrf and your attached profile.

kovidgoyal
02-01-2008, 01:05 AM
Probably a difference in the two profiles. I just tested newsweek, commandline and GUI were 113s and 116s

I should probably update the wsj profile :)

EDIT:
Oldest_article is 3 vs. 7 which probably explains it. Also JTravers, is that the correct print url mapping?

JTravers
02-01-2008, 03:05 AM
Probably a difference in the two profiles. I just tested newsweek, commandline and GUI were 113s and 116s

I should probably update the wsj profile :)

EDIT:
Oldest_article is 3 vs. 7 which probably explains it. Also JTravers, is that the correct print url mapping?

Yes, that might explain it. Still, 36 minutes seems very long.

That print url mapping has always worked for me. You could probably clean up the end of the url too, but I've never found that to be necessary.

JTravers
02-01-2008, 03:11 AM
Oops left in a statement for debugging, I've re-uploaded the windows installer. Re-install and you should be fine.

I reinstalled and still get the same error.
Do I need to uninstall first?
Probably user error on my part. I will try again.

randcoop
02-05-2008, 07:12 PM
I've downloaded thenation.py and run web2lrf with it. Sort of works, but I can't quite get it. First problem is that I'm not sure about the dates that need to be inserted (one short and one long). And second (and bigger) problem is that I can't figure where to put my login and password.

Without that, I receive notices about needing to subscribe to download some content. And most of the articles seem to come from web postings, not the actual issue.

Any help would be appreciated.

Valloric
02-11-2008, 12:31 PM
I posted user profiles for Jutarni.hr (the online version of Croatia's most popular newspaper) and USATODAY to the ticket system. I apologize if the ticket system was not the correct way of informing you about them, but it just seemed like it was the right way to do it.

I saw that ticket with all those different requests for news feeds, and if I have the time, I'll try to work through the list. I'm currently working on The New Yorker. Will add it when it's done.

If I mess up a profile, please tell me about it and I'll try to fix it.

kovidgoyal
02-11-2008, 01:48 PM
Cool, I'll add them in the next release.

Valloric
02-11-2008, 03:45 PM
Kovid, you have a terrible little bug in web2lrf... maybe not so a bug as a design oversight...

For the last 5 hours I have been attempting to create a The New Yorker user profile, and no matter what I did, the code only retrieved TWO articles from the site... I tried everything... and then I realized what was the problem.

Your code that checks the oldest_article variable... It starts at the top of the feed and continues down, checking each article's date. When it finds an article older than the number in oldest_article, it stops checking subsequent articles. WELL! The RSS feeds on TNY website are not sorted by date, but by some quasi-alphabetical sort, so when this code finds an old article at the very top of the feed (very very likely), it doesn't grab the newer ones which are lower in the listing.

Please fix this so it checks each and every article in the list.

I have uploaded the The New Yorker profile with its oldest_article variable set to 90, it was the only way I could get the newer articles. When you fix the bug, fix the profile accordingly. Everything else about it works fine.

Platapie
02-17-2008, 01:16 PM
Kovid, I've said this before but with the Economist profile feel the need to say this again. This program is phenomenal, particularly given its OS independence and the .deb packages and ebuilds. I'm a subscriber to the Economist and will I imagine often use your service rather than reading the paper edition more often than not.

Thanks again.

kovidgoyal
02-17-2008, 02:36 PM
Interesting, I know about the ebuilds, but are the deb packages being maintained as well?

JSWolf
02-17-2008, 07:22 PM
Kovid, you have a terrible little bug in web2lrf... maybe not so a bug as a design oversight...

For the last 5 hours I have been attempting to create a The New Yorker user profile, and no matter what I did, the code only retrieved TWO articles from the site... I tried everything... and then I realized what was the problem.

Your code that checks the oldest_article variable... It starts at the top of the feed and continues down, checking each article's date. When it finds an article older than the number in oldest_article, it stops checking subsequent articles. WELL! The RSS feeds on TNY website are not sorted by date, but by some quasi-alphabetical sort, so when this code finds an old article at the very top of the feed (very very likely), it doesn't grab the newer ones which are lower in the listing.

Please fix this so it checks each and every article in the list.

I have uploaded the The New Yorker profile with its oldest_article variable set to 90, it was the only way I could get the newer articles. When you fix the bug, fix the profile accordingly. Everything else about it works fine.
Please create a ticket so it can be fixed.

ddavtian
02-21-2008, 07:09 PM
Hi guys.

I'm using the WSJ profile and it works very well (thanks to JTravers for the profile).

I have a quick question: is is possible to get all the articles from a page, not from a feed? RSS feed for "Today's Newspaper" has only 5 articles from front page plus few more from other sections. I'd like to get as many articles from printed edition ("http://online.wsj.com/page/2_0133.html") as possible.

I replaced an existing link with this one, but got a blank page:
def get_feeds(self):
return [
(' Today\'s Newspaper - All', 'http://online.wsj.com/page/2_0133.html'),
## (' Today\'s Newspaper - Page One', 'http://online.wsj.com/xml/rss/3_7205.xml'),
]

Any advise? I want all the links from "http://online.wsj.com/page/2_0133.html" page that have "article" in their address. I don't think I need to change the clean-up part, current profile all the work.

This must be a simple question for Kovid, JTravers and others who have created their profiles.

Thanks in advance,
David

kovidgoyal
02-21-2008, 07:13 PM
It's certainly doable, but in irder to do it, you have to parse the HTML from that page, see for example the feed for The Atlantic.

ddavtian
02-21-2008, 07:17 PM
Do you live here? :-)

I didn't see Atlantic under UserProfiles. Where can I find it?

Thanks, David

ddavtian
02-21-2008, 07:31 PM
Kovid, ignore my previous message. A quick search and I found the thread about Atlantic.

Have to search first.

kovidgoyal
02-21-2008, 07:35 PM
Email notifications :)
https://libprs500.kovidgoyal.net/browser/trunk/src/libprs500/ebooks/lrf/web/profiles/atlantic.py

ddavtian
02-22-2008, 01:48 PM
Hi Kovid and all.

I looked at Atlantic and other profiles, seemed straightforward to parse the WSJ page. But knowing nothing about pyton doesn't help.

Now I get to the point where it finds the links and downloads (I think it downloads), then I get this error:

Traceback (most recent call last):
File "convert_from.py", line 192, in <module>
File "convert_from.py", line 186, in main
File "convert_from.py", line 125, in process_profile
File "libprs500\ebooks\lrf\web\profiles\__init__.pyo", line 100, in __init__
File "libprs500\ebooks\lrf\web\profiles\__init__.pyo", line 136, in build_inde
x
File "libprs500\ebooks\lrf\web\profiles\__init__.pyo", line 115, in build_sub_
index
KeyError: u'date'


Here is the part that I changed:
def parse_feeds(self):
src = self.browser.open('http://online.wsj.com/page/2_0133.html').read()
soup = BeautifulSoup(src)

articles = []
for item in soup.findAll('a', attrs={'class':'bold80'}):
url = item['href']
url = 'http://online.wsj.com'+url.replace('/article', '/article_print')
title = self.tag_to_string(item)
articles.append({
'title':title, 'url':url, 'description':''
})

return {'Todays Paper' : articles }


I didn't change the get_browser and preprocess_regexps, working fine in existing profile.

Do you see anything obvious in my lines? I know not much info here to troubleshoot.

I usually get one shot to run it in 2-3 hours. Because web2lrf doesn't log off from their site, next run cannot login for some time. How do you guys develop your profiles? Not much fun :-(

Kovid, if you have nothing better to do and have time/desire to help me here, you have my login/password in your pm box, 2-3 weeks old. Just add "5" at the end of password, had to change at some point.

Thanks in advance,
David

kovidgoyal
02-22-2008, 02:54 PM
In the line articles.append You should have a 'date':time.time()

This will give all articles the default date. If you want the correct publication date you should parse the HTML for it.

Note that you can define a cleanup function to logout. Something like


def cleanup(self):
self.browser.open('http://wsj.com/logout')


EDIT: Oops should be time.ctime() not time.time()

ddavtian
02-22-2008, 03:10 PM
Thank you for quick reply!

I added the date, but cannot test it now because of their "security" policy :-( Not many logins allowed.

I had added the same lines (from existing profile) for logout, but I cannot get it working. After creating the output, web2lrf does not exist (doesn't return to command prompt), just sits there:


[INFO] convert_from.pyo:360: Converting to BBeB...
[INFO] convert_from.pyo:283: Rationalizing font sizes...
[INFO] convert_from.pyo:1754: Output written to C:\Misc\News\Wall Street Print Edition [Fri, Feb 22, 2008].lrf

At this point I have to kill it. And WSJ doesn't like the next run. Without logging in, it simply creates an empty file because no articles are found.

Thanks again for your help. I'll try again later.

David

kovidgoyal
02-22-2008, 03:17 PM
I remember the WSJ website had some redirect nastiness that prevent web2lrf from logging out, there are some post about it earlier in this thread.

JTravers
02-22-2008, 06:47 PM
I remember the WSJ website had some redirect nastiness that prevent web2lrf from logging out, there are some post about it earlier in this thread.

Yeah, I pretty much just gave up trying to get the logout function to work. It'd be great if David was able to stumble upon a way to make it work. I wish you luck.

With the current profile, I'm very careful about logging out of the site in my web browser before running it. And then I only run it once. Kind of a paid when you're testing changes in the profile, but I usually do most testing first without my login info.

ddavtian
02-22-2008, 06:57 PM
You smarter guys couldn't fix it, and I didn't get any luck, so the same logout problem.

But with Kovid's help (and using profile from JTravers) I got my paper working. Now I'm getting all the articles from the print edition. It's not as nice as other profiles, simply lists all the articles by the page order (A1, A2..., B1, ..., etc.). Their feeds do not cover all articles from paper. Sometimes I start reading the paper in the morning, then leave for subway. Now I can continue reading the same article on the reader.

JTravers
02-23-2008, 01:15 AM
But with Kovid's help (and using profile from JTravers) I got my paper working. Now I'm getting all the articles from the print edition. It's not as nice as other profiles, simply lists all the articles by the page order (A1, A2..., B1, ..., etc.). Their feeds do not cover all articles from paper. Sometimes I start reading the paper in the morning, then leave for subway. Now I can continue reading the same article on the reader.

I'd love for you to post the profile, if you don't mind. I wanted to set the same kind of thing up on my own Reader but just didn't bother trying to do it since setting up feeds in web2lrf is so easy.

Thanks in advance!

ddavtian
02-23-2008, 02:47 AM
I've used your profile, only changed the parsing part to this new parse_feeds.
I'm using the run time, didn't get the correct date from the page. Size is half of main WSJ profile (around 2Mb). Feel free to improve and post to libprc.

Here is the method (I don't know how to post correctly, all indentation is gone):

def parse_feeds(self):
src = self.browser.open('http://online.wsj.com/page/2_0133.html').read()
soup = BeautifulSoup(src)
issue_date = time.ctime()

articles = []
for item in soup.findAll('a', attrs={'class':'bold80'}):
url = item['href']
url = 'http://online.wsj.com'+url.replace('/article', '/article_print')
title = self.tag_to_string(item)
articles.append({
'title':title, 'url':url, 'description':'', 'date':issue_date
})


return {'Todays Paper' : articles }

JTravers
02-25-2008, 07:20 PM
I've used your profile, only changed the parsing part to this new parse_feeds.
I'm using the run time, didn't get the correct date from the page. Size is half of main WSJ profile (around 2Mb). Feel free to improve and post to libprc.


Thanks!
I'll take a look at it when I get some free time.

kovidgoyal
03-12-2008, 05:22 PM
I'm in the process of refactoring web2lrf to make it much more powerful and easier to use (impossible, I know). Here's an example of how it is more powerful, see the attached Newsweek ebook (downloaded using multi-threading in 10mins).

Feedback on the formatting and anything else is appreciated.

llasram
03-12-2008, 06:34 PM
Feedback on the formatting and anything else is appreciated.

I really like the addition of the navigation panel at the beginning of each article, but do you think it would be possible for the 'next' link to come first, at least in the link-selection sequence? This might break the "traditional" order of navigation elements, but being able to skip through the articles with one button-press per article would greatly facilitate skimming (which at least for me would be the most common nav. panel use case).

kovidgoyal
03-12-2008, 07:00 PM
Yeah that's a good idea.

ddavtian
03-12-2008, 09:55 PM
This looks great.

And I like the idea for the "next" link.

Now the question: when it will be ready? :-)

kovidgoyal
03-12-2008, 10:19 PM
Well the code for it has already been committed to svn. Just needs testing, integration. It probably wont reach the GUI for at least a couple more releases.

ddavtian
03-12-2008, 10:56 PM
Hi guys.

I'd like to get my local newspaper into the reader but couldn't do it. I used get_feeds to get articles from rss page but not much luck. I'm only getting the first article with tons of unnecessary pages (tried different patterns, couldn't clean the text).

If anybody has some time, please take a look at this one feed (http://feeds.contracostatimes.com/mngi/rss/CustomRssServlet/571/200819.xml).

Thanks a lot in advance,
David

bobbyco57
03-12-2008, 11:56 PM
This is really great. I did have a problem however.

I first put the file on an sd card and was reading in "S" size mode. I pressed zoom as I have limited vision and always need to zoom. After several seconds of the processing arrow image on the screen, instead of coming back to Newsweek the reader did a reset.

I then 1. removed the file from the sd card and 2) moved the file from my library to the 505 main memory. The first time I opened the book, the 505 reset before getting to menu. I tried a second time and it acted as when on the SD card, I could navigate through the book on S, but the device reset when trying to redisplay after press zoom.

kovidgoyal
03-13-2008, 12:30 AM
yeah not much i can do about that, its a bug in SONY's reader software. What you can do is use the sony connect software to transfer the file to your reader, all three sizes will have been pre-calculated. And when I actually release the sofware you can specify the base-font-size to whatever you like, so that you dont have to resize.

Deputy-Dawg
03-14-2008, 02:01 AM
Hi guys.

I'd like to get my local newspaper into the reader but couldn't do it. I used get_feeds to get articles from rss page but not much luck. I'm only getting the first article with tons of unnecessary pages (tried different patterns, couldn't clean the text).

If anybody has some time, please take a look at this one feed (http://feeds.contracostatimes.com/mngi/rss/CustomRssServlet/571/200819.xml).

Thanks a lot in advance,
David

The attached script will download the "Most Viewed" feed. I have thus far been unable to capture more than the lead article from the other feeds. There is some subtle difference in them that is eluding me.

But in any event it shows you how to clean up the file so that you get rid of the extra garbage, including the embedded "Advertisement" block.

ddavtian
03-14-2008, 10:56 AM
Deputy-Dawg, thank you!

It's lots of cleaning, I couldn't get even small part of it. I have no idea what to do for only one article per section but this is already very good.

Thanks again for your help.
David

ddavtian
03-14-2008, 08:48 PM
I have no idea what to do for only one article per section but this is already very good.


The only working section for Contra Costa Times is coming from "http://extras.mnginteractive.com/live/xsl/memv/xml/571_most_viewed_rss.xml". When I try to get feeds from newspaper's site (http://feeds.contracostatimes.com/mngi/rss/CustomRssServlet/571/200819.xml for example), it brings the first article only.

I tried another site with the same feeds (http://rss.mnginteractive.com/live/ContraCosta/CCN_1916854.xml), not much luck here. Now it gets all the articles but only the summary.

All three sites forward to the main newspaper server for articles, but only the first one works correctly. :chinscratch:
This is out of my league anyway.

Dawg, thanks again for your help.

Kovid, I moved from 0.4.38 to 0.4.42, fetching news has become mush faster. 30 minutes for NYTimes is down to few minutes. Same thing for other sources.
:thanks:

Deputy-Dawg
03-14-2008, 08:54 PM
Deputy-Dawg, thank you!

It's lots of cleaning, I couldn't get even small part of it. I have no idea what to do for only one article per section but this is already very good.

Thanks again for your help.
David


David,
I think I have resolved the problem with capturing more than one article in a feed. The problem is that web2lrf sees pubdate as having a different format in the first article in the feed than the format of pubdate in all of the other articles. What it sees as the pubdate in the first article is:

Fri, 14 Mar 2008 23:22:24 MDT or Fri, 14 2008 23:22:24 -000

While in all of the articles it sees:

3/14/2008 01:37:26 AM GMT

There a couple of solutions (work arounds) each of which have advantages and gotchas.

The first, and easiest to implement is to simply set use_pubdate = 'False' which simply tells the program to ignore the embedded pubdate and use the current machine time as the pubdate. This will permit capturing all of the articles in a feed but you will have no record as to when it was published.

The second is to create pubdate_fmt which matches the format of articles two and up. Now all of the articles captured will have their appropriate pubdates with the penalty of not capturing the first article in the feed.

I have written a script and attached it to this message in which you can test and see the results of this rather odd situation. In C_Cost_2.py there are two lines of code you are interested in:

##pubdate_fmt = '%m/%d/%Y %I:%M:%S %p %Z'
use_pubdate = False

Configured as above it will ignore the embedded pubdate and capture all of the articles in the feed(s)

##pubdate_fmt = '%m/%d/%Y %I:%M:%S %p %Z'
##use_pubdate = False

Configured this way it will only capture the first article in a feed.

pubdate_fmt = '%m/%d/%Y %I:%M:%S %p %Z'
##use_pubdate = False

and configured this way it will capture all the files except the first file in a feed.

I really am not convinced that there are really two different pubdate formats in the feeds, but we are looking at some other artifact that is confusing the matter for web2lrf. Hopefully Kovid will chime in and tell me what is wrong with my analysis and suggest a much more elegant fix. At least I hope so. In the mean time here is a solution to your problem.

kovidgoyal
03-14-2008, 09:35 PM
I would normally, but I'm hip deep in refactoring web2lrf, Hopefully, the new improved version will just automatically parse the date correctly.

ddavtian
03-14-2008, 11:30 PM
Deputy-Dawg, thanks a lot.
It's perfectly fine to download all articles from the feed. I added my favorite sections and got the newspaper on reader in few minutes.

This is a great community.

David

Rick C
03-14-2008, 11:40 PM
Has anybody written a tutorial for this web2lrf program? One that is geared towards those of us that are clumsy around console commands would be especially nice.
I haven't bothered much with RSS up until now, but what I would like to do is save text based websites with links and all into my PRS-505 so I can read them at my leisure. Is that something I can do with this program as well?

Deputy-Dawg
03-15-2008, 01:16 PM
I would normally, but I'm hip deep in refactoring web2lrf, Hopefully, the new improved version will just automatically parse the date correctly.

And it is difficult, at times, to remember that the task is to drain the swamp when you are up to your A** in alligators.

I genuinely appreciate the fact that you are re-doing web2lrf. But I am darn curious as to how it can see two different formats for pubdate in articles in the same feed. This is especially so since I spent most of last evening studying the News feed and for the life of me I can see no difference between the first article and the second. On the other hand I am just a neophyte in parsing RSS feeds. Is there a document that would give me a "road map"

kovidgoyal
03-15-2008, 01:48 PM
Looking at one of those feeds, it doesn't seem like the date formats are different. But the new infrastructure has code for auto-detecting date formats based on the RSS/ATOM specifications, so hopefully you wont need to specify a pubdate format anymore.

kovidgoyal
03-15-2008, 08:51 PM
Here's The Atlantic with some more refinements to the news fetching code. Again, comments are welcome.

ddavtian
03-15-2008, 11:19 PM
Kovad, let me know if you need a tester for new web2lrf.

On a related note, Economist was working in 0.4.38, not in 0.4.42.

kovidgoyal
03-16-2008, 02:01 AM
Economist has been re-written for the new code.. It works now. I'll post links to beta builds here once the new code is ready.

banjopicker
03-16-2008, 04:07 AM
Here's The Atlantic with some more refinements to the news fetching code. Again, comments are welcome.
The Atlantic looks good, and the navigation at the at the beginning of each article will be useful.

It might be more useful if the "Up one level" link were at the end of the article and not the beginning, so that when you finish reading an article you can quickly return to the index.

Also, if possible, it would be nice if the "Up one level" took you to the page of the index that links to the article containing it. Right now it seems to return you to the beginning of the index every time (for very large indexes like the Economist or NYT, the smarter link would be especially useful).

I haven't chimed in with everyone else yet to thank you for developing this tool, so thank you. I am living overseas and have relied on my 500 for the last year. Unfortunately the screen just died two weeks ago. I am anxiously awaiting my 505 to arrive next week, even more so because I have just discovered how far you have progressed with web2lrf, which should make my new reader significantly more useful.

kovidgoyal
03-16-2008, 04:20 AM
Yeah I'll add the navbar at the bottom as well. The Up one level links actually do point to the the individual article, but because of limitations in link resolution of the LRF format, the reader software takes you to the top of the section.

Deputy-Dawg
03-16-2008, 08:16 AM
Economist has been re-written for the new code.. It works now. I'll post links to beta builds here once the new code is ready.

Can hardly wait. Like a kit on the 23rd of December!!!!

kovidgoyal
03-18-2008, 06:42 PM
OK The beta code is available at

http://theory.caltech.edu/~kovid/libprs500-0.4.42.exe
http://theory.caltech.edu/~kovid/libprs500-0.4.42.dmg
http://theory.caltech.edu/~kovid/libprs500-0.4.42-py2.5.egg

You can use the new code with

feeds2lrf Newsweek
feeds2lrf Dilbert
feeds2lrf "The Atlantic"
feeds2lrf Portfolio


All the old profiles can be also be fetched with feeds2lrf (though this needs testing). feeds2lrf --help to see the names you have to use.

You can try your own profiles with

feeds2lrf filename.py


Where filename.py is the name of a file with your custom recipe (profiles have been renamed to recipes) in it.

For debugging, feeds2lrf has a couple of useful options: --test and --debug

Also you can just do the download without conversion to LRF with feeds2disk

The new builtin recipes are available at http://libprs500.kovidgoyal.net/browser/trunk/src/libprs500/web/feeds/recipes

The cool new features like remove_tags, remove_tags_after, remove_tags_before, etc. are documented in the code of BasicNewsRecipe
at http://libprs500.kovidgoyal.net/browser/trunk/src/libprs500/web/feeds/news.py

The one thing that I haven't tested is Custom profiles using the GUI, but all the old builtin profiles should work.

Note that this is very new code, so there are going to be bugs, especially on non-linux platforms.

Deputy-Dawg
03-18-2008, 09:12 PM
Using the recipe for my local paper, nwa2.py, from the command line I get:

Macintosh-3:books billc$ feeds2lrf --test nwa2.py
Traceback (most recent call last):
File "libprs500/web/feeds/main.pyo", line 108, in run_recipe
File "libprs500/web/feeds/recipes/__init__.pyo", line 59, in compile_recipe
File "<string>", line 33

^
SyntaxError: invalid syntax
Traceback (most recent call last):
File "/Users/billc/Downloads/libprs500-2.app/Contents/Resources/feeds2lrf.py", line 9, in <module>
main()
File "libprs500/ebooks/lrf/feeds/convert_from.pyo", line 56, in main
RuntimeError: Fetching of recipe failed: nwa2.py
Macintosh-3:books billc$

And when run from the GuI I get:


'unicode' object has no attribute 'needs_subscription'
Detailed traceback:
Traceback (most recent call last):
File "libprs500/gui2/news.pyo", line 62, in fetch_news
AttributeError: 'unicode' object has no attribute 'needs_subscription'

kovidgoyal
03-18-2008, 09:25 PM
Oops, typo, I'll upload a fixed build in a bit.

kovidgoyal
03-18-2008, 10:01 PM
Fixed version re-uploaded. You should delete the bottom two blank lines from that profile as well.

ddavtian
03-18-2008, 10:22 PM
Kovid, I got an error when trying to install the reloaded version: The installer is corrupted or incomplete.

It's the exact same size as the previous one (from 2 hours ago), but doesn't install. This is the "exe" file for windows. I installed the first version, tried the built-in profiles. Newsweek worked fine, I also got Dilbert, USA Today (not a good profile, articles have very small and large fonts on the same page, it was like that all the time). Most of the profiles errored out (may be they are already corrected in your latest build).

David

ddavtian
03-18-2008, 11:17 PM
I kept downloading the file and third time it went fine, sorry for false alarm.

It installed correctly, I tried NY Times (command line) and Washington Post (GUI), both errored out.

NY Times:
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Traceback (most recent call last):
File "convert_from.py", line 71, in <module>
File "convert_from.py", line 67, in main
File "libprs500\ebooks\lrf\html\convert_from.pyo", line 1799, in process_file
File "libprs500\ebooks\lrf\html\convert_from.pyo", line 275, in __init__
File "libprs500\ebooks\lrf\html\convert_from.pyo", line 383, in add_file
File "libprs500\ebooks\lrf\html\convert_from.pyo", line 495, in parse_file
File "libprs500\ebooks\lrf\html\convert_from.pyo", line 698, in process_childr
en
File "libprs500\ebooks\lrf\html\convert_from.pyo", line 1665, in parse_tag
File "libprs500\ebooks\lrf\html\convert_from.pyo", line 698, in process_childr
en
File "libprs500\ebooks\lrf\html\convert_from.pyo", line 1665, in parse_tag
File "libprs500\ebooks\lrf\html\convert_from.pyo", line 698, in process_childr
en
File "libprs500\ebooks\lrf\html\convert_from.pyo", line 1622, in parse_tag
File "libprs500\ebooks\lrf\html\convert_from.pyo", line 698, in process_childr
en
File "libprs500\ebooks\lrf\html\convert_from.pyo", line 1622, in parse_tag
File "libprs500\ebooks\lrf\html\convert_from.pyo", line 698, in process_childr
en
File "libprs500\ebooks\lrf\html\convert_from.pyo", line 1622, in parse_tag
File "libprs500\ebooks\lrf\html\convert_from.pyo", line 698, in process_childr
en
File "libprs500\ebooks\lrf\html\convert_from.pyo", line 1585, in parse_tag
File "libprs500\ebooks\lrf\html\convert_from.pyo", line 1309, in process_block

File "libprs500\ebooks\lrf\html\convert_from.pyo", line 1048, in block_propert
ies
File "libprs500\ebooks\lrf\html\convert_from.pyo", line 1042, in get
IndexError: list assignment index out of range


Washington Post:
TypeError: coercing to Unicode: need string or buffer, builtin_function_or_method found
Failed to perform job: Fetch news from Washington Post
Detailed traceback:
Traceback (most recent call last):
File "parallel.py", line 154, in run_job
File "libprs500\ebooks\lrf\feeds\convert_from.pyo", line 52, in main
File "libprs500\web\feeds\main.pyo", line 140, in run_recipe
File "libprs500\web\feeds\news.pyo", line 386, in download
TypeError: coercing to Unicode: need string or buffer, builtin_function_or_method found

banjopicker
03-18-2008, 11:38 PM
Downloaded the Economist successfully. Nav bar at the top of each article looks and works great.

Two quick observations:
1. On the last article of a feed (in a document that has many feeds), no "next" option is available in the nav bar, requiring the user to scroll through the article to get to the first article of the next feed.

2. To be consistent with the overall navigation, the index for each feed should have an "Up one level" link so the user can return to the main index of feeds from a sub-index.

I ran this using the Windows GUI. It seemed to create the .lrf more quickly than it has in the past (or it might just be my connection). Unfortunately, the file could not be opened in your lrf viewer, but opened up fine in Sony's. Still waiting for my 505 to test it on.

I think your refactoring is going to pay off--the additional methods and debug tools should cut down significantly on the requests for you to look at individual recipes. Thanks.

Deputy-Dawg
03-19-2008, 12:17 AM
Kovid,
I removed the two blank lines from nwa2.py (they had already been removed from the code I placed in the Custom News Sources) and still no joy. When I run from the GUI I get the following:

'unicode' object has no attribute 'needs_subscription'
Detailed traceback:
Traceback (most recent call last):
File "libprs500/gui2/news.pyo", line 62, in fetch_news
AttributeError: 'unicode' object has no attribute 'needs_subscription'

When I run from the command line with the --test switch the program simply hangs. If I use the --verbose switch I get:

Macintosh-3:books billc$ feeds2lrf --verbose nwa2.py
Fetching feeds...
0% [---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------]
Starting download [1 thread(s)]... WARNING: Could not fetch link .prt

WARNING: Could not fetch link .prt

WARNING: Could not fetch link .prt

WARNING: Could not fetch link .prt

WARNING: Could not fetch link .prt

WARNING: Could not fetch link .prt

WARNING: Could not fetch link .prt

WARNING: Could not fetch link .prt

WARNING: Could not fetch link .prt

WARNING: Could not fetch link .prt

WARNING: Could not fetch link .prt

ERROR: Failed to download article: Ambulance Agreement Passes Fayetteville from .prt

2% [=====----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------]
Article download failed: Ambulance Agreement Passes Fayetteville ERROR: Failed to download article: Siloam Springs Passes Comprehensive Plan from .prt

4% [==========-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------]
Article download failed: Siloam Springs Passes Comprehensive Plan ERROR: Failed to download article: Commission Approves Concept Plan from .prt

6% [===============------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------]
Article download failed: Commission Approves Concept Plan ERROR: Failed to download article: Lesbian Doctor Married In Boston Seeks Annulment in Missouri from .prt

8% [====================-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------]
Article download failed: Lesbian Doctor Married In Boston Seeks Annulment in Missouri ERROR: Failed to download article: Lawmakers Endorse Severance Tax Hike from .prt

11% [=========================--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------]
Article download failed: Lawmakers Endorse Severance Tax Hike ERROR: Failed to download article: Opposition Turns Into Support from .prt

13% [==============================---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------]
Article download failed: Opposition Turns Into Support ERROR: Failed to download article: Sheriff Candidates Hit Common Themes from .prt

15% [===================================----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------]
Article download failed: Sheriff Candidates Hit Common Themes ERROR: Failed to download article: Alliance Endorses Treatment Over Incarceration In Methamphetamine Cases from .prt

17% [=========================================----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------]
Article download failed: Alliance Endorses Treatment Over Incarceration In Methamphetamine Cases ERROR: Failed to download article: Inmates To Appear Before State Claims Commission from .prt

20% [==============================================-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------]
Article download failed: Inmates To Appear Before State Claims Commission ERROR: Failed to download article: National Group Praises Arkansas Mental Health Care Data System from .prt

22% [================================================== =------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------]
Article download failed: National Group Praises Arkansas Mental Health Care Data System ERROR: Failed to download article: Groups Plan Pre-Election Debates, Forums from .prt

24% [================================================== ======-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------]
Article download failed: Groups Plan Pre-Election Debates, Forums WARNING: Could not fetch link .prt

WARNING: Could not fetch link .prt

WARNING: Could not fetch link .prt

WARNING: Could not fetch link .prt

WARNING: Could not fetch link .prt

WARNING: Could not fetch link .prt

WARNING: Could not fetch link .prt

WARNING: Could not fetch link .prt

WARNING: Could not fetch link .prt

WARNING: Could not fetch link .prt

WARNING: Could not fetch link .prt

WARNING: Could not fetch link .prt

ERROR: Failed to download article: Nucor Companies Give More Than $1 Million To Scholarship Fund from .prt

26% [================================================== ===========--------------------------------------------------------------------------------------------------------------------------------------------------------------------------]
Article download failed: Nucor Companies Give More Than $1 Million To Scholarship Fund ERROR: Failed to download article: White County Brides, Grooms Exchange Vows At Historic Residence Converted To Wedding Chapel from .prt

28% [================================================== ================---------------------------------------------------------------------------------------------------------------------------------------------------------------------]
Article download failed: White County Brides, Grooms Exchange Vows At Historic Residence Converted To Wedding Chapel ERROR: Failed to download article: Two Die, One Lives In Possible Murder-Suicide from .prt

31% [================================================== =====================----------------------------------------------------------------------------------------------------------------------------------------------------------------]
Article download failed: Two Die, One Lives In Possible Murder-Suicide ERROR: Failed to download article: West Fork Man Missing After Rushing Creek Sweeps Away Truck from .prt

33% [================================================== ===========================----------------------------------------------------------------------------------------------------------------------------------------------------------]
Article download failed: West Fork Man Missing After Rushing Creek Sweeps Away Truck ERROR: Failed to download article: Rain Brings Flooding In Parts Of Arkansas from .prt

35% [================================================== ================================-----------------------------------------------------------------------------------------------------------------------------------------------------]
Article download failed: Rain Brings Flooding In Parts Of Arkansas ERROR: Failed to download article: Farmers Watch Weather As Planting Season Approaches from .prt

37% [================================================== =====================================------------------------------------------------------------------------------------------------------------------------------------------------]
Article download failed: Farmers Watch Weather As Planting Season Approaches ERROR: Failed to download article: Public Meetings from .prt

40% [================================================== ==========================================-------------------------------------------------------------------------------------------------------------------------------------------]
Article download failed: Public Meetings ERROR: Failed to download article: Springdale Plans To Raze Canning Company Buildings from .prt

42% [================================================== ===============================================--------------------------------------------------------------------------------------------------------------------------------------]
Article download failed: Springdale Plans To Raze Canning Company Buildings ERROR: Failed to download article: Payday Lenders Must Shut Down Or Face Lawsuits, Attorney General Says from .prt

44% [================================================== ================================================== ==---------------------------------------------------------------------------------------------------------------------------------]
Article download failed: Payday Lenders Must Shut Down Or Face Lawsuits, Attorney General Says ERROR: Failed to download article: Registration Under Way For Fincher Run from .prt

46% [================================================== ================================================== =======----------------------------------------------------------------------------------------------------------------------------]
Article download failed: Registration Under Way For Fincher Run ERROR: Failed to download article: Dance Company Receives Honors from .prt

48% [================================================== ================================================== ============-----------------------------------------------------------------------------------------------------------------------]
Article download failed: Dance Company Receives Honors ERROR: Failed to download article: Card Results from .prt

51% [================================================== ================================================== ==================-----------------------------------------------------------------------------------------------------------------]
Article download failed: Card Results WARNING: Could not fetch link .prt

WARNING: Could not fetch link .prt

WARNING: Could not fetch link .prt

WARNING: Could not fetch link .prt

WARNING: Could not fetch link .prt

WARNING: Could not fetch link .prt

WARNING: Could not fetch link .prt

WARNING: Could not fetch link .prt

WARNING: Could not fetch link .prt

WARNING: Could not fetch link .prt

WARNING: Could not fetch link .prt

ERROR: Failed to download article: Military Briefs from .prt

53% [================================================== ================================================== =======================------------------------------------------------------------------------------------------------------------]
Article download failed: Military Briefs ERROR: Failed to download article: Student Chosen For State Geographic Bee from .prt

55% [================================================== ================================================== ============================-------------------------------------------------------------------------------------------------------]
Article download failed: Student Chosen For State Geographic Bee WARNING: Could not fetch link .prt

ERROR: Failed to download article: Food Bank Participating In Challenge from .prt

57% [================================================== ================================================== =================================--------------------------------------------------------------------------------------------------]
Article download failed: Food Bank Participating In Challenge ERROR: Failed to download article: Walton Arts Centers Celebrates Its 100% Schools Program with Crayola from .prt

60% [================================================== ================================================== ======================================---------------------------------------------------------------------------------------------]
Article download failed: Walton Arts Centers Celebrates Its 100% Schools Program with Crayola ERROR: Failed to download article: Local Notes from .prt

62% [================================================== ================================================== ===========================================----------------------------------------------------------------------------------------]
Article download failed: Local Notes WARNING: Could not fetch link .prt

ERROR: Failed to download article: Painter Wins Miss Lakes Of The Northwest Crown from .prt

64% [================================================== ================================================== ================================================-----------------------------------------------------------------------------------]
Article download failed: Painter Wins Miss Lakes Of The Northwest Crown ERROR: Failed to download article: Students Receive Thousands from Northwest Medical Center - Bentonville Auxiliary from .prt

66% [================================================== ================================================== ================================================== ====-----------------------------------------------------------------------------]
Article download failed: Students Receive Thousands from Northwest Medical Center - Bentonville Auxiliary ERROR: Failed to download article: Education Briefs from .prt

68% [================================================== ================================================== ================================================== =========------------------------------------------------------------------------]
Article download failed: Education Briefs ERROR: Failed to download article: Briefly from .prt

71% [================================================== ================================================== ================================================== ==============-------------------------------------------------------------------]
Article download failed: Briefly ERROR: Failed to download article: Williams Will Celebrate 100th Birthday from .prt

73% [================================================== ================================================== ================================================== ===================--------------------------------------------------------------]
Article download failed: Williams Will Celebrate 100th Birthday ERROR: Failed to download article: Mentoring Child Encourages Desired Behaviors from .prt

75% [================================================== ================================================== ================================================== ========================---------------------------------------------------------]
Article download failed: Mentoring Child Encourages Desired Behaviors ERROR: Failed to download article: Upcoming Events from .prt

77% [================================================== ================================================== ================================================== =============================----------------------------------------------------]
Article download failed: Upcoming Events ERROR: Failed to download article: History Day from .prt

80% [================================================== ================================================== ================================================== ==================================-----------------------------------------------]
Article download failed: History Day WARNING: Could not fetch link .prt

WARNING: Could not fetch link .prt

WARNING: Could not fetch link .prt

WARNING: Could not fetch link .prt

WARNING: Could not fetch link .prt

WARNING: Could not fetch link .prt

WARNING: Could not fetch link .prt

WARNING: Could not fetch link .prt

WARNING: Could not fetch link .prt

ERROR: Failed to download article: Wades Donate $10,000 For Healing Path Project from .prt

82% [================================================== ================================================== ================================================== =======================================------------------------------------------]
Article download failed: Wades Donate $10,000 For Healing Path Project ERROR: Failed to download article: 'Mentoring Champion' from .prt

84% [================================================== ================================================== ================================================== =============================================------------------------------------]
Article download failed: 'Mentoring Champion' ERROR: Failed to download article: Owl Team Takes Gold Medal from .prt

86% [================================================== ================================================== ================================================== ==================================================-------------------------------]
Article download failed: Owl Team Takes Gold Medal ERROR: Failed to download article: Mini Grand Prix Slated For April 5 from .prt

88% [================================================== ================================================== ================================================== ================================================== =====--------------------------]
Article download failed: Mini Grand Prix Slated For April 5 ERROR: Failed to download article: Chapman Gets 'The Word Out' In Local Concert from .prt

91% [================================================== ================================================== ================================================== ================================================== ==========---------------------]
Article download failed: Chapman Gets 'The Word Out' In Local Concert ERROR: Failed to download article: Religion Notes from .prt

93% [================================================== ================================================== ================================================== ================================================== ===============----------------]
Article download failed: Religion Notes ERROR: Failed to download article: MUSINGS from .prt

95% [================================================== ================================================== ================================================== ================================================== ====================-----------]
Article download failed: MUSINGS ERROR: Failed to download article: The Road To Calvary from .prt

97% [================================================== ================================================== ================================================== ================================================== =========================------]
Article download failed: The Road To Calvary ERROR: Failed to download article: Laity, Clergy Drawn To Conference On Caring For Earth from .prt

100% [================================================== ================================================== ================================================== ================================================== ===============================]
Download finished WARNING: Failed to download the following articles:

Traceback (most recent call last):
File "/Users/billc/Downloads/libprs500-1.app/Contents/Resources/feeds2lrf.py", line 9, in <module>
main()
File "libprs500/ebooks/lrf/feeds/convert_from.pyo", line 52, in main
File "libprs500/web/feeds/main.pyo", line 140, in run_recipe
File "libprs500/web/feeds/news.pyo", line 386, in download
TypeError: coercing to Unicode: need string or buffer, builtin_function_or_method found
Macintosh-3:books billc$

I ran the recipe in the older version of lbrs500 and I had no problems. It would appear that among other things it is not getting the URL of the print file corretly. If I need to change the code, not a problem, I haven't a clue as to where to begin.

kovidgoyal
03-19-2008, 12:36 AM
And it works with

web2lrf --user-profile nwn2.py

?

Deputy-Dawg
03-19-2008, 01:17 AM
And it works with

web2lrf --user-profile nwn2.py

?

My bad! I had made a change in the code in the Custom News Source in the GUI a couple of weeks ago and had not changed the command line version to match. But, yes, that code will run with web2lrf it just won't produce any data in the linked files. I have attached the corrected code to this message. It now produces useful out put when run from the command line. But I still get the following from the GUI.

'unicode' object has no attribute 'needs_subscription'
Detailed traceback:
Traceback (most recent call last):
File "libprs500/gui2/news.pyo", line 62, in fetch_news
AttributeError: 'unicode' object has no attribute 'needs_subscription''unicode' object has no attribute 'needs_subscription'
Detailed traceback:
Traceback (most recent call last):
File "libprs500/gui2/news.pyo", line 62, in fetch_news
AttributeError: 'unicode' object has no attribute 'needs_subscription'

Which is sort of odd since there is not sign in/password requirement on this RSS feed.

kovidgoyal
03-19-2008, 01:24 AM
Yeah custom profiles in the GUI are broken. I have to redesign the custom new source dialog anyway to take into account the capabilities of the new code.

Deputy-Dawg
03-19-2008, 06:48 AM
Kovid,
I should never try to work 4 in the morning, particularly on days when I am scheduled for hemodialysis. That being said;

The New York Times feed is broken in both using feeds2lrf in both the GUI and when run from the command line. When run from the command line I get the following error message(s)

'The New York Times'
Fetching feeds...
1% [----------------------------------------------------------------------]
2% [=---------------------------------------------------------------------] 3% [==--------------------------------------------------------------------] 4% [===-------------------------------------------------------------------] 5% [===-------------------------------------------------------------------] 8% [======----------------------------------------------------------------] 9% [======----------------------------------------------------------------] 10% [=======---------------------------------------------------------------] 10% [=======---------------------------------------------------------------] 11% [========--------------------------------------------------------------] 11% [========--------------------------------------------------------------] 13% [=========-------------------------------------------------------------] 15% [==========------------------------------------------------------------] 16% [===========-----------------------------------------------------------] 17% [============----------------------------------------------------------] 17% [============----------------------------------------------------------] 18% [============----------------------------------------------------------] 28% [====================--------------------------------------------------] 30% [=====================-------------------------------------------------] 31% [======================------------------------------------------------] 32% [======================------------------------------------------------] 33% [=======================-----------------------------------------------] 35% [=========================---------------------------------------------] 36% [=========================---------------------------------------------] 42% [=============================-----------------------------------------] 52% [====================================----------------------------------] 53% [=====================================---------------------------------] 53% [=====================================---------------------------------] 56% [=======================================-------------------------------] 57% [========================================------------------------------] 64% [============================================--------------------------] 64% [=============================================-------------------------] 66% [==============================================------------------------] 69% [================================================----------------------] 70% [=================================================---------------------] 71% [=================================================---------------------] 72% [==================================================--------------------] 73% [================================================== =-------------------] 79% [================================================== =====---------------] 83% [================================================== ========------------] 90% [================================================== =============-------] 91% [================================================== =============-------] 92% [================================================== ==============------] 92% [================================================== ==============------] 94% [================================================== ================----] 96% [================================================== =================---] 96% [================================================== =================---] 97% [================================================== ==================--] 97% [================================================== ==================--]100% [================================================== ====================]100% [================================================== ====================] Download finished Generating LRF...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
An error occurred while processing a table: list index out of range. Ignoring table markup.
Processing index.html
Parsing HTML...
Converting to BBeB...
An error occurred while processing a table: list index out of range. Ignoring table markup.
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Processing index.html
Parsing HTML...
Converting to BBeB...
Traceback (most recent call last):
File "/Users/billc/Downloads/libprs500-1.app/Contents/Resources/feeds2lrf.py", line 9, in <module>
main()
File "libprs500/ebooks/lrf/feeds/convert_from.pyo", line 67, in main
File "libprs500/ebooks/lrf/html/convert_from.pyo", line 1799, in process_file
File "libprs500/ebooks/lrf/html/convert_from.pyo", line 275, in __init__
File "libprs500/ebooks/lrf/html/convert_from.pyo", line 383, in add_file
File "libprs500/ebooks/lrf/html/convert_from.pyo", line 495, in parse_file
File "libprs500/ebooks/lrf/html/convert_from.pyo", line 698, in process_children
File "libprs500/ebooks/lrf/html/convert_from.pyo", line 1665, in parse_tag
File "libprs500/ebooks/lrf/html/convert_from.pyo", line 698, in process_children
File "libprs500/ebooks/lrf/html/convert_from.pyo", line 1665, in parse_tag
File "libprs500/ebooks/lrf/html/convert_from.pyo", line 698, in process_children
File "libprs500/ebooks/lrf/html/convert_from.pyo", line 1622, in parse_tag
File "libprs500/ebooks/lrf/html/convert_from.pyo", line 698, in process_children
File "libprs500/ebooks/lrf/html/convert_from.pyo", line 1622, in parse_tag
File "libprs500/ebooks/lrf/html/convert_from.pyo", line 698, in process_children
File "libprs500/ebooks/lrf/html/convert_from.pyo", line 1622, in parse_tag
File "libprs500/ebooks/lrf/html/convert_from.pyo", line 698, in process_children
File "libprs500/ebooks/lrf/html/convert_from.pyo", line 1585, in parse_tag
File "libprs500/ebooks/lrf/html/convert_from.pyo", line 1309, in process_block
File "libprs500/ebooks/lrf/html/convert_from.pyo", line 1048, in block_properties
File "libprs500/ebooks/lrf/html/convert_from.pyo", line 1042, in get
IndexError: list assignment index out of range
Macintosh-3:books billc$


I have no idea what is going on here.

Deputy-Dawg
03-19-2008, 04:11 PM
Kovid,
Just a tickler - I don't know how you monitor the fora.

kovidgoyal
03-19-2008, 04:15 PM
Not too worry a bug fixed re-release of the beta will be heading out either later today or tomorrow.

kovidgoyal
03-19-2008, 07:04 PM
New build are up with a bunch of bug fixes and a new NYTimes recipe that downloads only the current day's paper in under 3mins (on a fast connection/computer).

Note that using custom profiles from the GUI is still broken.

Deputy-Dawg
03-19-2008, 09:10 PM
Kovid,
Down loaded the latest beta and the The New York Times does indeed download. But I am unable to open the file in the lbprs500 viewer. The progress bar goes to about 90% completion and then the program simply quits loading, or at least there is no further advancement of the progress bar thought it does still responds to input from the mouse and keyboard. The file does load on the PRS-505. Probably not a fatal error but one which might tend to produce a support headache.

BTW I get the same result whether the output file is created in the GUI or from the command line.

kovidgoyal
03-19-2008, 09:26 PM
Hmm works for me. Did you download it with a correct username and password?

Deputy-Dawg
03-19-2008, 10:01 PM
Yes, and the file does work correctly in the Sony Reader. It is only in the reader built into lbprs500 were it hangs. Weird!

Even stranger - I went back and loaded the last released version of lbprs500 (4.42) and its reader loaded the file just fine. So then to be sure that the problem was 'real' i tried it in the beta and sure enough it hung. I tried to visually capture the versions of the reader but they simply load to quickly for me to get the version numbers.

kovidgoyal
03-20-2008, 06:56 PM
version 0.4.43 should hit the servers in a bit, with the updated code used throughout. There are probably still bugs, though I've tested most of the commonly used profiles.

touser
03-20-2008, 07:38 PM
Call me stupid but i cant find the beta download directory on your website? Is there one, or an svn? Also, is there a changelog posted anywhere? Thanks for creating such great software, without libprs500 i would have returned my sony reader to the store, now i use it everyday.

Deputy-Dawg
03-20-2008, 07:42 PM
Kovid,
I just down loaded 4.43 and a quick run shows that all of the problems I had noted have been resolved.

There is an issue, and it would not surprise me that there is no good fix, the 'headlines' in USA Today are huge. The body type in the articles are fine. In fact setting the basefont size wouldn't work because if you set it small enough to fix the headlines the body text would be much to small. Seems to me that it would be case for a regexp that would set the font size smaller if it is larger than X. But I haven't a clue as to how to write such a conditional.

And while I am about it, how difficult would be to look at the system clock and gather a group of feeds at a set time. To be able to do that would be a godsend on the days that I have hemodialysis. Imagine, when I get up at 4:30 the morning paper would be there in my laptop right along side my coffee from the automatic coffee pot.

I can dream can't I?

kovidgoyal
03-20-2008, 07:42 PM
Use version 0.4.43, it's newer than the betas. And there's a link to the changelog next to the download link on the libprs500 website.

kovidgoyal
03-20-2008, 07:46 PM
@DD

Actually using the new "recipes" framework, you don't even need regexps to fix USA Today, you can use the new extra_css property to override the font size for headings.

As for, a download scheduler, it's on my TODO list, but my TODO list is so long, that I really can't make any sort of commitment as to when it will get done.

In the meantime, you can always use cron on an OS X system with the commandline tools to schedule automatic downloads.

kovidgoyal
03-20-2008, 08:02 PM
Here's the USAToday recipe:


from libprs500.web.feeds.news import BasicNewsRecipe
import re

class USAToday(BasicNewsRecipe):

title = 'USA Today'
timefmt = ' [%d %b %Y]'
max_articles_per_feed = 20
no_stylesheets = True
extra_css = '''
.inside-head { font: x-large bold }
.inside-head2 { font: x-large bold }
.inside-head3 { font: x-large bold }
.byLine { font: large }
'''
html2lrf_options = ['--ignore-tables']

preprocess_regexps = [
(re.compile(r'<BODY.*?<!--Article Goes Here-->', re.IGNORECASE | re.DOTALL), lambda match : '<BODY>'),
(re.compile(r'<!--Article End-->.*?</BODY>', re.IGNORECASE | re.DOTALL), lambda match : '</BODY>'),
]

feeds = [
('Top Headlines', 'http://rssfeeds.usatoday.com/usatoday-NewsTopStories'),
('Sport Headlines', 'http://rssfeeds.usatoday.com/UsatodaycomSports-TopStories'),
('Tech Headlines', 'http://rssfeeds.usatoday.com/usatoday-TechTopStories'),
('Travel Headlines', 'http://rssfeeds.usatoday.com/UsatodaycomTravel-TopStories'),
('Money Headlines', 'http://rssfeeds.usatoday.com/UsatodaycomMoney-TopStories'),
('Entertainment Headlines', 'http://rssfeeds.usatoday.com/usatoday-LifeTopStories'),
('Weather Headlines', 'http://rssfeeds.usatoday.com/usatoday-WeatherTopStories'),
('Most Popular', 'http://rssfeeds.usatoday.com/Usatoday-MostViewedArticles'),
]

## Getting the print version

def print_version(self, url):
return 'http://www.printthis.clickability.com/pt/printThis?clickMap=printThis&fb=Y&url=' + url

ddavtian
03-21-2008, 09:59 PM
Kovid, it's again me and the profile for WSJ.

With your help I got the profile working with web2lrf. When trying feeds2lrf, I get an error message about "time:

File "<string>", line 29, in WallStreetJournalPaper
NameError: name 'time' is not defined
Traceback (most recent call last):
File "convert_from.py", line 71, in <module>
File "convert_from.py", line 56, in main

This is the line 29 from profile:
issue_date = time.ctime()

Later "issue_date" is used here:
articles.append({
'title':title, 'url':url, 'description':'', 'date':issue_date
})


Can you tell me what to put here for it to work again?

kovidgoyal
03-21-2008, 10:18 PM
Add

import time

at the top

ddavtian
03-21-2008, 10:24 PM
"import time" is already there and it works with web2lrf.

Deputy-Dawg
03-22-2008, 12:28 AM
Kovid,
Thanks for the fixed recipe for USAToday. Looks much better to these tired eyes. Also thanks for the tip about cron. I did not realize such a utility was available on the Mac. Maybe its time to take a look under the hood.

Searching the web I found a GUI for cron called croniX_3.0.2. When you run it gives the ability to create a custom crontab file.

When I run the following command from the bash terminal:

feeds2lrf --output=/users/billc/desktop/news.lrf desktop/books/nwa2.py

I produce an output file called news.lrf on my desktop. I then deleted the file and put the same command into cronniX and used the 'Run Now' commmand (under the 'Task' drop down menu) all I got was:

Running command
feeds2lrf --output=/users/billc/desktop/news.lrf desktop/books/nwa2.py
The output will appear below when the command has finished executing
Fetching feeds...

then the program goes off into lala land and produces no output. Clearly there is something wrong! Is there one of those cryptic commands like sh that should precede the main command? Or What?

kovidgoyal
03-22-2008, 02:07 AM
"import time" is already there and it works with web2lrf.

Attach it here.

kovidgoyal
03-22-2008, 02:08 AM
Kovid,
Thanks for the fixed recipe for USAToday. Looks much better to these tired eyes. Also thanks for the tip about cron. I did not realize such a utility was available on the Mac. Maybe its time to take a look under the hood.

Searching the web I found a GUI for cron called croniX_3.0.2. When you run it gives the ability to create a custom crontab file.

When I run the following command from the bash terminal:

feeds2lrf --output=/users/billc/desktop/news.lrf desktop/books/nwa2.py

I produce an output file called news.lrf on my desktop. I then deleted the file and put the same command into cronniX and used the 'Run Now' commmand (under the 'Task' drop down menu) all I got was:

Running command
feeds2lrf --output=/users/billc/desktop/news.lrf desktop/books/nwa2.py
The output will appear below when the command has finished executing
Fetching feeds...

then the program goes off into lala land and produces no output. Clearly there is something wrong! Is there one of those cryptic commands like sh that should precede the main command? Or What?

Use an absolute path to nwa2.py

ddavtian
03-22-2008, 02:55 AM
Attach it here.
Attached, as a txt file.

Thanks in advance.

kovidgoyal
03-22-2008, 06:19 AM
Move the import statements to just above where the imported modules are used. A proper fix will be in the next release. Why aren't you using the built-in Wall Street Journal?

ddavtian
03-22-2008, 01:03 PM
Thanks Kovid.

It helped, now it runs. But it didn't get any articles (jumped from "0% Starting download" to "100% Feeds downloaded"). I'll try to fix it myself.

Built-in WSJ is good but it doesn't have many articles from the paper edition. This one was getting all articles from paper.

David

kovidgoyal
03-22-2008, 01:11 PM
You can still run it using web2lrf instead of feeds2lrf

Deputy-Dawg
03-22-2008, 03:26 PM
Boy you talk about being invincibly ignorant. I knew enough to use the absolute path to the saved file but it never occurred to me that you should use the absolute path to the recipe file. All of which is to say it works! Thanks.

Do you have any idea what the publication date is for the current edition of the Atlantic Monthly? I would like to set up a command in Crontab to capture it each month.

Deputy-Dawg
03-22-2008, 05:01 PM
Kovid,
I downloaded the Atlantic Monthly recipe from your website with the intention of modifing it to capture the daily feed from them. I modified the recipe by as follows:


#!/usr/bin/env python

## Copyright (C) 2008 Kovid Goyal kovid@kovidgoyal.net
## This program is free software; you can redistribute it and/or modify
## it under the terms of the GNU General Public License as published by
## the Free Software Foundation; either version 2 of the License, or
## (at your option) any later version.
##
## This program is distributed in the hope that it will be useful,
## but WITHOUT ANY WARRANTY; without even the implied warranty of
## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
## GNU General Public License for more details.
##
## You should have received a copy of the GNU General Public License along
## with this program; if not, write to the Free Software Foundation, Inc.,
## 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
'''
thecurrent.theatlantic.com
'''

from libprs500.web.feeds.news import BasicNewsRecipe
from libprs500.ebooks.BeautifulSoup import BeautifulSoup

class TheAtlantic(BasicNewsRecipe):

title = 'THeCrrent.The Atlantic'
INDEX = 'http://thecurrent.theatlantic.com/'

remove_tags_before = dict(name='div', id='storytop')
remove_tags = [dict(name='div', id='seealso')]
extra_css = '#bodytext {line-height: 1}'

def parse_index(self):
articles = []

src = self.browser.open(self.INDEX).read()
soup = BeautifulSoup(src, convertEntities=BeautifulSoup.HTML_ENTITIES)

issue = soup.find('span', attrs={'class':'issue'})
if issue:
self.timefmt = ' [%s]'%self.tag_to_string(issue).rpartition('|')[-1].strip().replace('/', '-')

for item in soup.findAll('div', attrs={'class':'item'}):
a = item.find('a')
if a and a.has_key('href'):
url = a['href']
url = 'http://www.theatlantic.com/'+url.replace('/doc', 'doc/print')
title = self.tag_to_string(a)
byline = item.find(attrs={'class':'byline'})
date = self.tag_to_string(byline) if byline else ''
description = ''
articles.append({
'title':title,
'date':date,
'url':url,
'description':description
})


return {'Daily Issue' : articles }



When I run it I get:

Macintosh-3:books billc$ feeds2lrf atlantic-1.py
Fetching feeds...
0% [----------------------------------------------------------------------]
Fetching feeds... Traceback (most recent call last):
File "/Users/billc/Downloads/libprs500-1.app/Contents/Resources/feeds2lrf.py", line 9, in <module>
main()
File "libprs500/ebooks/lrf/feeds/convert_from.pyo", line 52, in main
File "libprs500/web/feeds/main.pyo", line 141, in run_recipe
File "libprs500/web/feeds/news.pyo", line 411, in download
File "libprs500/web/feeds/news.pyo", line 514, in build_index
File "<string>", line 37, in parse_index
NameError: global name 'BeautifulSoup' is not defined
Macintosh-3:books billc$


But it seems to me that 'BeautifulSoup' is defined in line 22 e.g.

rom libprs500.ebooks.BeautifulSoup import BeautifulSoup

What have I done wrong?

I Went back and ran the unmodified recipe in terminal mode and got the same result.

kovidgoyal
03-22-2008, 06:42 PM
Boy you talk about being invincibly ignorant. I knew enough to use the absolute path to the saved file but it never occurred to me that you should use the absolute path to the recipe file. All of which is to say it works! Thanks.

Do you have any idea what the publication date is for the current edition of the Atlantic Monthly? I would like to set up a command in Crontab to capture it each month.

Use the pseudo target @monthly in cron and it will be downloaded at 30-day intervals.