View Single Post
Old 01-30-2008, 05:50 AM   #5
alexxxm
Addict
alexxxm has a complete set of Star Wars action figures.alexxxm has a complete set of Star Wars action figures.alexxxm has a complete set of Star Wars action figures.alexxxm has a complete set of Star Wars action figures.
 
Posts: 223
Karma: 356
Join Date: Aug 2007
Device: Rocket; Hiebook; N700; Sony 505; Kindle DX ...
Thanks, secretsubscribe,
I'm beginning to see the light...
Now I can download a couple of MB of The Atlantic, but I still have one problem:
The text of each article is splitted in some parts, and at the end of each one you have the usual line reading: "Pages: 1 2 3 next>".
The url to which those numbers point are relative, e.g.:

<span class="hankpym">
<span class="safaritime">1</span>
<a href="/doc/200801/miller-education/2">2</a>
<a href="/doc/200801/miller-education/3">3</a>
</span>

<a href="/doc/200801/miller-education/2">next&gt;</a>

so I'd like to replace those, but if I add this:
preprocess_regexps = \
[ (re.compile(i[0], re.IGNORECASE | re.DOTALL), i[1]) for i in
[
(r'<a href="/', lambda match : match.group().replace(match.group(1), '<a href="http://www.theatlantic.com')),
# ....
]
]

in addition to yours (modified) def parse_feeds, it isnt able anymore to find any link.
So, how can I replace relative->absolute the links in the individual articles?

any hint appreciated...


Alessandro
alexxxm is offline   Reply With Quote