Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Readers > Sony Reader

Notices

Reply
 
Thread Tools Search this Thread
Old 01-23-2008, 09:33 AM   #1
alexxxm
Addict
alexxxm has a complete set of Star Wars action figures.alexxxm has a complete set of Star Wars action figures.alexxxm has a complete set of Star Wars action figures.alexxxm has a complete set of Star Wars action figures.
 
Posts: 223
Karma: 356
Join Date: Aug 2007
Device: Rocket; Hiebook; N700; Sony 505; Kindle DX ...
python coding...

I am trying to write down a simple applet for web2lrf/libprs500, to download the magazine the Atlantic (http://www.theatlantic.com/) - it is free since today...

damn, I dont know python so I have a couple of problems...

1) under http://www.theatlantic.com/doc/current, all the links are relative (e.g. <a href="/doc/200801/millbank">), so I began with:


preprocess_regexps = [(re.compile(i[0], re.IGNORECASE | re.DOTALL), i[1]) for i in
[
(r'<a href="/', lambda match : match.group().replace(match.group(1), '<a href="http://www.theatlantic.com')),
]
]


... is it right?

2) at the end of every run I get the error (freely translated by me: italian windows version!)

Exception exceptions.WindowsError: WindowsError(32, 'Impossible to access the file. File is used by another process') in <bound method atlantic.__de
l__ of <atlantic.atlantic object at 0x0111A690>> ignored

I add that I get this error even under other scripts I tried to write for other newspapers, but this didnt prevent an LRF output to be written.

In this case instead, the LRF just contains the header and nothing else - probably it has something to do with question 1)...

any idea?

Alessandro
alexxxm is offline   Reply With Quote
Old 01-23-2008, 03:04 PM   #2
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,850
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
1) No you need to re-implement the parse_feeds function so that it scan the page http://www.theatlantic.com/doc/current and returns a list of the form

Code:
[('Title', 'URL'), ('Title2', 'URL2'), ...]
Each URL will be of the form "http://www.theatlantic.com/" + the contents of the href attribute.

You can use the BeautifulSoup class to easily parse the HTML
kovidgoyal is offline   Reply With Quote
Advert
Old 01-25-2008, 04:04 AM   #3
alexxxm
Addict
alexxxm has a complete set of Star Wars action figures.alexxxm has a complete set of Star Wars action figures.alexxxm has a complete set of Star Wars action figures.alexxxm has a complete set of Star Wars action figures.
 
Posts: 223
Karma: 356
Join Date: Aug 2007
Device: Rocket; Hiebook; N700; Sony 505; Kindle DX ...
Quote:
Originally Posted by kovidgoyal View Post
1) No you need to re-implement the parse_feeds function <SNIP>
You can use the BeautifulSoup class to easily parse the HTML
except that (I'll mention it again) I'm everything but a python coder, and what I did so far was to scratch together pieces taken by the various other scripts for other feeds...
I'm afraid I didnt see in those any example of parse_feeds reimplementation, damn.

Alessandro
alexxxm is offline   Reply With Quote
Old 01-28-2008, 04:30 PM   #4
secretsubscribe
Enthusiast
secretsubscribe is a marvel to beholdsecretsubscribe is a marvel to beholdsecretsubscribe is a marvel to beholdsecretsubscribe is a marvel to beholdsecretsubscribe is a marvel to beholdsecretsubscribe is a marvel to beholdsecretsubscribe is a marvel to beholdsecretsubscribe is a marvel to beholdsecretsubscribe is a marvel to beholdsecretsubscribe is a marvel to beholdsecretsubscribe is a marvel to behold
 
Posts: 26
Karma: 11777
Join Date: Jun 2007
Location: Brooklyn
Device: PRS-500,Treo 750, Archos 605 Wifi
You might need to do something similar to what I did to download The Nation.
Check out the profile at
https://libprs500.kovidgoyal.net/att...s/thenation.py
secretsubscribe is offline   Reply With Quote
Old 01-30-2008, 05:50 AM   #5
alexxxm
Addict
alexxxm has a complete set of Star Wars action figures.alexxxm has a complete set of Star Wars action figures.alexxxm has a complete set of Star Wars action figures.alexxxm has a complete set of Star Wars action figures.
 
Posts: 223
Karma: 356
Join Date: Aug 2007
Device: Rocket; Hiebook; N700; Sony 505; Kindle DX ...
Thanks, secretsubscribe,
I'm beginning to see the light...
Now I can download a couple of MB of The Atlantic, but I still have one problem:
The text of each article is splitted in some parts, and at the end of each one you have the usual line reading: "Pages: 1 2 3 next>".
The url to which those numbers point are relative, e.g.:

<span class="hankpym">
<span class="safaritime">1</span>
<a href="/doc/200801/miller-education/2">2</a>
<a href="/doc/200801/miller-education/3">3</a>
</span>

<a href="/doc/200801/miller-education/2">next&gt;</a>

so I'd like to replace those, but if I add this:
preprocess_regexps = \
[ (re.compile(i[0], re.IGNORECASE | re.DOTALL), i[1]) for i in
[
(r'<a href="/', lambda match : match.group().replace(match.group(1), '<a href="http://www.theatlantic.com')),
# ....
]
]

in addition to yours (modified) def parse_feeds, it isnt able anymore to find any link.
So, how can I replace relative->absolute the links in the individual articles?

any hint appreciated...


Alessandro
alexxxm is offline   Reply With Quote
Advert
Old 01-30-2008, 11:25 AM   #6
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,850
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
You'll have to increase max_recursions and use --match-regexp
kovidgoyal is offline   Reply With Quote
Old 01-30-2008, 08:47 PM   #7
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,850
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Here's The Atlantic

Code:
##    Copyright (C) 2008 Kovid Goyal kovid@kovidgoyal.net
##    This program is free software; you can redistribute it and/or modify
##    it under the terms of the GNU General Public License as published by
##    the Free Software Foundation; either version 2 of the License, or
##    (at your option) any later version.
##
##    This program is distributed in the hope that it will be useful,
##    but WITHOUT ANY WARRANTY; without even the implied warranty of
##    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
##    GNU General Public License for more details.
##
##    You should have received a copy of the GNU General Public License along
##    with this program; if not, write to the Free Software Foundation, Inc.,
##    51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
import re
from libprs500.ebooks.lrf.web.profiles import DefaultProfile
from libprs500.ebooks.BeautifulSoup import BeautifulSoup

class Atlantic(DefaultProfile):
    
    title = 'The Atlantic'
    max_recursions = 2
    INDEX = 'http://www.theatlantic.com/doc/current'
    
    preprocess_regexps = [
                          (re.compile(r'<body.*?<div id="storytop"', re.DOTALL|re.IGNORECASE), 
                           lambda m: '<body><div id="storytop"')
                          ]
    
    def parse_feeds(self):
        articles = []
        
        src = self.browser.open(self.INDEX).read()
        soup = BeautifulSoup(src)
        
        issue = soup.find('span', attrs={'class':'issue'})
        if issue:
            self.timefmt = ' [%s]'%self.tag_to_string(issue).rpartition('|')[-1].strip().replace('/', '-')
        
        for item in soup.findAll('div', attrs={'class':'item'}):
            a = item.find('a')
            if a and a.has_key('href'):
                url = a['href']
                url = 'http://www.theatlantic.com/'+url.replace('/doc', 'doc/print')
                title = self.tag_to_string(a)
                byline = item.find(attrs={'class':'byline'})
                date = self.tag_to_string(byline) if byline else ''
                description = ''
                articles.append({
                                 'title':title,
                                 'date':date,
                                 'url':url,
                                 'description':description
                                })
                
        
        return {'Current Issue' : articles }
kovidgoyal is offline   Reply With Quote
Old 01-31-2008, 04:35 AM   #8
alexxxm
Addict
alexxxm has a complete set of Star Wars action figures.alexxxm has a complete set of Star Wars action figures.alexxxm has a complete set of Star Wars action figures.alexxxm has a complete set of Star Wars action figures.
 
Posts: 223
Karma: 356
Join Date: Aug 2007
Device: Rocket; Hiebook; N700; Sony 505; Kindle DX ...
thank you for the help!
Unfortunatlely, id dies at once, with this error:

File "C:\Programmi\libprs500\atlantic.py", line 42, in parse_feeds
self.timefmt = ' [%s]'%self.tag_to_string(issue).rpartition('|')[-1].strip().replace('/', '-')
AttributeError: 'Atlantic' object has no attribute 'tag_to_string'

what do you think?

Alessandro
alexxxm is offline   Reply With Quote
Old 01-31-2008, 12:25 PM   #9
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,850
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Upgrade to the latest version of libprs (it's a builtin feed there)
kovidgoyal is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Seriously thoughtful Coding Help Requested poohbear_nc Lounge 10 08-24-2010 10:42 AM
using python with windows xp tuufbiz1 Kindle Formats 10 05-05-2009 11:53 PM
Python 2.5 and Calibre FizzyWater Calibre 1 03-27-2009 02:15 AM
Python 2.5 or 2.6? itimpi Calibre 5 01-19-2009 12:48 PM
Some horrible and outrageous examples of disgraceful coding Snowman Lounge 44 12-15-2008 03:18 PM


All times are GMT -4. The time now is 11:00 AM.


MobileRead.com is a privately owned, operated and funded community.