Thread: web2lrf
View Single Post
Old 03-14-2008, 07:54 PM   #202
Deputy-Dawg
Groupie
Deputy-Dawg has learned how to read e-booksDeputy-Dawg has learned how to read e-booksDeputy-Dawg has learned how to read e-booksDeputy-Dawg has learned how to read e-booksDeputy-Dawg has learned how to read e-booksDeputy-Dawg has learned how to read e-booksDeputy-Dawg has learned how to read e-books
 
Deputy-Dawg's Avatar
 
Posts: 153
Karma: 799
Join Date: Dec 2007
Device: sony prs505
Quote:
Originally Posted by ddavtian View Post
Deputy-Dawg, thank you!

It's lots of cleaning, I couldn't get even small part of it. I have no idea what to do for only one article per section but this is already very good.

Thanks again for your help.
David

David,
I think I have resolved the problem with capturing more than one article in a feed. The problem is that web2lrf sees pubdate as having a different format in the first article in the feed than the format of pubdate in all of the other articles. What it sees as the pubdate in the first article is:

Fri, 14 Mar 2008 23:22:24 MDT or Fri, 14 2008 23:22:24 -000

While in all of the articles it sees:

3/14/2008 01:37:26 AM GMT

There a couple of solutions (work arounds) each of which have advantages and gotchas.

The first, and easiest to implement is to simply set use_pubdate = 'False' which simply tells the program to ignore the embedded pubdate and use the current machine time as the pubdate. This will permit capturing all of the articles in a feed but you will have no record as to when it was published.

The second is to create pubdate_fmt which matches the format of articles two and up. Now all of the articles captured will have their appropriate pubdates with the penalty of not capturing the first article in the feed.

I have written a script and attached it to this message in which you can test and see the results of this rather odd situation. In C_Cost_2.py there are two lines of code you are interested in:

Code:
    ##pubdate_fmt = '%m/%d/%Y %I:%M:%S %p %Z'
    use_pubdate = False
Configured as above it will ignore the embedded pubdate and capture all of the articles in the feed(s)

Code:
    ##pubdate_fmt = '%m/%d/%Y %I:%M:%S %p %Z'
    ##use_pubdate = False
Configured this way it will only capture the first article in a feed.

Code:
    pubdate_fmt = '%m/%d/%Y %I:%M:%S %p %Z'
    ##use_pubdate = False
and configured this way it will capture all the files except the first file in a feed.

I really am not convinced that there are really two different pubdate formats in the feeds, but we are looking at some other artifact that is confusing the matter for web2lrf. Hopefully Kovid will chime in and tell me what is wrong with my analysis and suggest a much more elegant fix. At least I hope so. In the mean time here is a solution to your problem.
Attached Files
File Type: zip C_Costa_2.py.zip (1.2 KB, 271 views)

Last edited by Deputy-Dawg; 03-14-2008 at 07:56 PM. Reason: To mark code statements
Deputy-Dawg is offline   Reply With Quote