Quote:
Originally Posted by Mixx
Hi,
I trying out a number of recipes and wonder if we should have a sticky to identify those recipes that do not work properly anymore and need an update.
I'd like to add
Code:
Business Week
CIO (CIO Magazin)
Inquirer.net (The Inquirer)
My hope is that fellow forum readers with the right skills might want to fix one or two, if their time permits.
We could even have a voting list to see which are (or rather would be) the most popular ones.
Regards, Mixx
|
Regarding Business week, here is what I have done:
I have reviewed the business week recipe (the one from Kovid Goyal and Darko Miletic, which is the one that works)
It fails because business week is not consistent about how it writes its pages.
I've had to change two thigs:
At line 44:
keep_only_tags = [dict(name='div', attrs={'id':['story-body','storyBody]})]
for
keep_only_tags = [dict(name='div', attrs={'id':['story-body','storyBody','article_body','articleBody']})]
some of its pages have the main article under "StoryBody" DIV, but others are under "article_body" DIV (that is why they don't work)
At line 92:
rurl = url.replace('http://www.businessweek.com/','http://www.businessweek.com/print/')
for
if '/magazine' in url:
rurl = url.replace('http://www.businessweek.com/','http://www.businessweek.com/printer/')
else:
rurl = url.replace('http://www.businessweek.com/','http://www.businessweek.com/print/')
Some of its articles have a printer page whose name is built in a different way
This way, the recipe continues to have some extra data (that needs to be deleted) but at least works.
Hope that someone can make a better correction. For the moment I attach my changes just in case someone may find them useful.
Best regards.