MobileRead Forums - View Single Post - Custom recipes (archive, read-only)

motdiem · 04-08-2009, 05:48 PM

So, I was looking to build a recipe for NY Magazine - I've tried building my own, but my python/beautiful soup skills are not so great...

the toc is here: http://nymag.com/includes/tableofcontents.htm but it gets redirected to a different page each week with the format: http://nymag.com/nymag/toc/YYYYMMDD/ (where DD is the next monday...)

I've managed to strip the page down to what I need, but I don't understand how to fetch the articles, etc...

Basically:

Code:

remove_tags_before = dict(id='magazine-toc')
remove_tags_after  = dict(attrs={'class':['attention']})
remove_tags = [dict(attrs={'class':['cover']}),
                dict(name=['h2'])]

It leave with a page where the structure of an article is:

Code:

<h5><a href="link_to_article">article title</a></h5>
<p>article blurb</p>

(but no enclosing div or anything) - So I'm unsure how to link the article title to the key, link, etc

... I then want to replace the article url to go to the print version, which is basically:

Code:

http://www.printthis.clickability.com/pt/cpt?action=cpt&title=ARTICLE-TITLE&expire=&urlID=STRANGE-NUMBER&fb=Y&url=ARTICLE-URL

I can't figure out where the STRANGE-NUMBER is coming from in the article page either....

Hope this makes sense - Thanks for your help

04-08-2009, 05:48 PM	#430
motdiem Junior Member Posts: 4 Karma: 10 Join Date: Apr 2009 Device: PRS-505	NYMag.com recipe help So, I was looking to build a recipe for NY Magazine - I've tried building my own, but my python/beautiful soup skills are not so great... the toc is here: http://nymag.com/includes/tableofcontents.htm but it gets redirected to a different page each week with the format: http://nymag.com/nymag/toc/YYYYMMDD/ (where DD is the next monday...) I've managed to strip the page down to what I need, but I don't understand how to fetch the articles, etc... Basically: Code: remove_tags_before = dict(id='magazine-toc') remove_tags_after = dict(attrs={'class':['attention']}) remove_tags = [dict(attrs={'class':['cover']}), dict(name=['h2'])] It leave with a page where the structure of an article is: Code: <h5><a href="link_to_article">article title</a></h5> <p>article blurb</p> (but no enclosing div or anything) - So I'm unsure how to link the article title to the key, link, etc ... I then want to replace the article url to go to the print version, which is basically: Code: http://www.printthis.clickability.com/pt/cpt?action=cpt&title=ARTICLE-TITLE&expire=&urlID=STRANGE-NUMBER&fb=Y&url=ARTICLE-URL I can't figure out where the STRANGE-NUMBER is coming from in the article page either.... Hope this makes sense - Thanks for your help Last edited by motdiem; 04-08-2009 at 05:56 PM.