python coding...

alexxxm · 01-23-2008, 09:33 AM

I am trying to write down a simple applet for web2lrf/libprs500, to download the magazine the Atlantic (http://www.theatlantic.com/) - it is free since today...

damn, I dont know python so I have a couple of problems...

1) under http://www.theatlantic.com/doc/current, all the links are relative (e.g. <a href="/doc/200801/millbank">), so I began with:

preprocess_regexps = [(re.compile(i[0], re.IGNORECASE | re.DOTALL), i[1]) for i in
[
(r'<a href="/', lambda match : match.group().replace(match.group(1), '<a href="http://www.theatlantic.com')),
]
]

... is it right?

2) at the end of every run I get the error (freely translated by me: italian windows version!)

Exception exceptions.WindowsError: WindowsError(32, 'Impossible to access the file. File is used by another process') in <bound method atlantic.__de
l__ of <atlantic.atlantic object at 0x0111A690>> ignored

I add that I get this error even under other scripts I tried to write for other newspapers, but this didnt prevent an LRF output to be written.

In this case instead, the LRF just contains the header and nothing else - probably it has something to do with question 1)...

any idea?

Alessandro

kovidgoyal · 01-23-2008, 03:04 PM

1) No you need to re-implement the parse_feeds function so that it scan the page http://www.theatlantic.com/doc/current and returns a list of the form

Code:

[('Title', 'URL'), ('Title2', 'URL2'), ...]

Each URL will be of the form "http://www.theatlantic.com/" + the contents of the href attribute.

You can use the BeautifulSoup class to easily parse the HTML

alexxxm · 01-25-2008, 04:04 AM

Quote:

Originally Posted by kovidgoyal

1) No you need to re-implement the parse_feeds function <SNIP>
You can use the BeautifulSoup class to easily parse the HTML

except that (I'll mention it again) I'm everything but a python coder, and what I did so far was to scratch together pieces taken by the various other scripts for other feeds...
I'm afraid I didnt see in those any example of parse_feeds reimplementation, damn.

Alessandro

secretsubscribe · 01-28-2008, 04:30 PM

You might need to do something similar to what I did to download The Nation.
Check out the profile at
https://libprs500.kovidgoyal.net/att...s/thenation.py

alexxxm · 01-30-2008, 05:50 AM

Thanks, secretsubscribe,
I'm beginning to see the light...
Now I can download a couple of MB of The Atlantic, but I still have one problem:
The text of each article is splitted in some parts, and at the end of each one you have the usual line reading: "Pages: 1 2 3 next>".
The url to which those numbers point are relative, e.g.:

<span class="hankpym">
<span class="safaritime">1</span>
<a href="/doc/200801/miller-education/2">2</a>
<a href="/doc/200801/miller-education/3">3</a>
</span>

<a href="/doc/200801/miller-education/2">next></a>

so I'd like to replace those, but if I add this:
preprocess_regexps = \
[ (re.compile(i[0], re.IGNORECASE | re.DOTALL), i[1]) for i in
[
(r'<a href="/', lambda match : match.group().replace(match.group(1), '<a href="http://www.theatlantic.com')),
# ....
]
]

in addition to yours (modified) def parse_feeds, it isnt able anymore to find any link.
So, how can I replace relative->absolute the links in the individual articles?

any hint appreciated...

Alessandro

kovidgoyal · 01-30-2008, 11:25 AM

You'll have to increase max_recursions and use --match-regexp

kovidgoyal · 01-30-2008, 08:47 PM

Here's The Atlantic

Code:

##    Copyright (C) 2008 Kovid Goyal kovid@kovidgoyal.net
##    This program is free software; you can redistribute it and/or modify
##    it under the terms of the GNU General Public License as published by
##    the Free Software Foundation; either version 2 of the License, or
##    (at your option) any later version.
##
##    This program is distributed in the hope that it will be useful,
##    but WITHOUT ANY WARRANTY; without even the implied warranty of
##    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
##    GNU General Public License for more details.
##
##    You should have received a copy of the GNU General Public License along
##    with this program; if not, write to the Free Software Foundation, Inc.,
##    51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
import re
from libprs500.ebooks.lrf.web.profiles import DefaultProfile
from libprs500.ebooks.BeautifulSoup import BeautifulSoup

class Atlantic(DefaultProfile):
    
    title = 'The Atlantic'
    max_recursions = 2
    INDEX = 'http://www.theatlantic.com/doc/current'
    
    preprocess_regexps = [
                          (re.compile(r'<body.*?<div id="storytop"', re.DOTALL|re.IGNORECASE), 
                           lambda m: '<body><div id="storytop"')
                          ]
    
    def parse_feeds(self):
        articles = []
        
        src = self.browser.open(self.INDEX).read()
        soup = BeautifulSoup(src)
        
        issue = soup.find('span', attrs={'class':'issue'})
        if issue:
            self.timefmt = ' [%s]'%self.tag_to_string(issue).rpartition('|')[-1].strip().replace('/', '-')
        
        for item in soup.findAll('div', attrs={'class':'item'}):
            a = item.find('a')
            if a and a.has_key('href'):
                url = a['href']
                url = 'http://www.theatlantic.com/'+url.replace('/doc', 'doc/print')
                title = self.tag_to_string(a)
                byline = item.find(attrs={'class':'byline'})
                date = self.tag_to_string(byline) if byline else ''
                description = ''
                articles.append({
                                 'title':title,
                                 'date':date,
                                 'url':url,
                                 'description':description
                                })
                
        
        return {'Current Issue' : articles }

alexxxm · 01-31-2008, 04:35 AM

thank you for the help!
Unfortunatlely, id dies at once, with this error:

File "C:\Programmi\libprs500\atlantic.py", line 42, in parse_feeds
self.timefmt = ' [%s]'%self.tag_to_string(issue).rpartition('|')[-1].strip().replace('/', '-')
AttributeError: 'Atlantic' object has no attribute 'tag_to_string'

what do you think?

Alessandro

kovidgoyal · 01-31-2008, 12:25 PM

Upgrade to the latest version of libprs (it's a builtin feed there)

01-23-2008, 09:33 AM	#1
alexxxm Addict Posts: 223 Karma: 356 Join Date: Aug 2007 Device: Rocket; Hiebook; N700; Sony 505; Kindle DX ...	python coding... I am trying to write down a simple applet for web2lrf/libprs500, to download the magazine the Atlantic (http://www.theatlantic.com/) - it is free since today... damn, I dont know python so I have a couple of problems... 1) under http://www.theatlantic.com/doc/current, all the links are relative (e.g. <a href="/doc/200801/millbank">), so I began with: preprocess_regexps = [(re.compile(i[0], re.IGNORECASE \| re.DOTALL), i[1]) for i in [ (r'<a href="/', lambda match : match.group().replace(match.group(1), '<a href="http://www.theatlantic.com')), ] ] ... is it right? 2) at the end of every run I get the error (freely translated by me: italian windows version!) Exception exceptions.WindowsError: WindowsError(32, 'Impossible to access the file. File is used by another process') in <bound method atlantic.__de l__ of <atlantic.atlantic object at 0x0111A690>> ignored I add that I get this error even under other scripts I tried to write for other newspapers, but this didnt prevent an LRF output to be written. In this case instead, the LRF just contains the header and nothing else - probably it has something to do with question 1)... any idea? Alessandro

01-23-2008, 03:04 PM	#2
kovidgoyal creator of calibre Posts: 43,850 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	1) No you need to re-implement the parse_feeds function so that it scan the page http://www.theatlantic.com/doc/current and returns a list of the form Code: [('Title', 'URL'), ('Title2', 'URL2'), ...] Each URL will be of the form "http://www.theatlantic.com/" + the contents of the href attribute. You can use the BeautifulSoup class to easily parse the HTML

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Seriously thoughtful Coding Help Requested	poohbear_nc	Lounge	10	08-24-2010 10:42 AM
using python with windows xp	tuufbiz1	Kindle Formats	10	05-05-2009 11:53 PM
Python 2.5 and Calibre	FizzyWater	Calibre	1	03-27-2009 02:15 AM
Python 2.5 or 2.6?	itimpi	Calibre	5	01-19-2009 12:48 PM
Some horrible and outrageous examples of disgraceful coding	Snowman	Lounge	44	12-15-2008 03:18 PM

01-28-2008, 04:30 PM	#4
secretsubscribe Enthusiast Posts: 26 Karma: 11777 Join Date: Jun 2007 Location: Brooklyn Device: PRS-500,Treo 750, Archos 605 Wifi	You might need to do something similar to what I did to download The Nation. Check out the profile at https://libprs500.kovidgoyal.net/att...s/thenation.py

01-30-2008, 05:50 AM	#5
alexxxm Addict Posts: 223 Karma: 356 Join Date: Aug 2007 Device: Rocket; Hiebook; N700; Sony 505; Kindle DX ...	Thanks, secretsubscribe, I'm beginning to see the light... Now I can download a couple of MB of The Atlantic, but I still have one problem: The text of each article is splitted in some parts, and at the end of each one you have the usual line reading: "Pages: 1 2 3 next>". The url to which those numbers point are relative, e.g.: <span class="hankpym"> <span class="safaritime">1</span> <a href="/doc/200801/miller-education/2">2</a> <a href="/doc/200801/miller-education/3">3</a> </span> <a href="/doc/200801/miller-education/2">next></a> so I'd like to replace those, but if I add this: preprocess_regexps = \ [ (re.compile(i[0], re.IGNORECASE \| re.DOTALL), i[1]) for i in [ (r'<a href="/', lambda match : match.group().replace(match.group(1), '<a href="http://www.theatlantic.com')), # .... ] ] in addition to yours (modified) def parse_feeds, it isnt able anymore to find any link. So, how can I replace relative->absolute the links in the individual articles? any hint appreciated... Alessandro

01-30-2008, 11:25 AM	#6
kovidgoyal creator of calibre Posts: 43,850 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	You'll have to increase max_recursions and use --match-regexp

01-31-2008, 04:35 AM	#8
alexxxm Addict Posts: 223 Karma: 356 Join Date: Aug 2007 Device: Rocket; Hiebook; N700; Sony 505; Kindle DX ...	thank you for the help! Unfortunatlely, id dies at once, with this error: File "C:\Programmi\libprs500\atlantic.py", line 42, in parse_feeds self.timefmt = ' [%s]'%self.tag_to_string(issue).rpartition('\|')[-1].strip().replace('/', '-') AttributeError: 'Atlantic' object has no attribute 'tag_to_string' what do you think? Alessandro

01-31-2008, 12:25 PM	#9
kovidgoyal creator of calibre Posts: 43,850 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Upgrade to the latest version of libprs (it's a builtin feed there)

Advert

Advert