View Single Post
Old 10-10-2011, 04:00 PM   #5
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by badhaggis View Post
Ok, spending a morning looking through this and really not making much progress. I've narrowed down what I need more information on the section quoted. I assume the parsing would go into the "for id in soup.findAll" loop below
I was thinking of two options. One was in get_feeds. You'd grab what you needed while the feeds were being worked on.

The other option was to do it at the article stage. You quoted my "Otherwise" which was to do it at the article stage, so you're not at the right point. You want to be in preprocess_html which works on the articles as they are fetched.

To do it there: Basically, after the article has been fetched, you can modify it, either before it's processed or after (using pre or postprocess_html). I was thinking you would regrab the RSS feed page (yes, at this point it's already been processed, the articles have been identified, etc. but that's OK).

You are just going to grab the RSS feed page again (you'd do it multiple times, once for each article) and grab some parts from it. So how do you do this? I was thinking - at the pre/post process stage you know the Article Title. It's part of the "soup" of the article page. (You need to use BS to find it there so you can use it) You want something from the feed page. That "something" is associated with the matching Article Title on the feed page, so while you are in preprocess_html (or postprocess - it doesn't matter) you use index_to_soup to grab a second soup - the soup of the feed page. As I posted, you would "parse it (the second soup form the feed page) to find what you want (e.g. search for the article title and grab the other elements/text you want once it's found)."

It would basically be a loop that looks through the feed page for the article title tag that matches the current article being worked on in pre/postprocess_html, then grabs whatever you need from that the second soup (the RSS feed page) that you need for the current article. Then use BeautifulSoup to stick it into the first soup (the article being worked on).

Last edited by Starson17; 10-10-2011 at 04:04 PM.
Starson17 is offline   Reply With Quote