Quote:
Originally Posted by Starson17
I did.
It looks like it's an <a> tag inside a <div class="artIntroShort"> tag. Correct? Then just do the same thing you did with an <a> tag inside an <h2> tag, but use div instead of h2 and specify the class. That should do it.
Nope. I can stop at any time. I like to see others with the same interest I have.
Edit: looking back at your code, I see that's sort of what you did, but you have an extra for loop layer at the leftcontent that I don't think you need.
|
Alright as you mentioned I had an extra for loop layer. I removed that and modified the code a little bit and it works fine with one exception. The reason I had done the extra for loop was to snag the title Which is the date at the top that was inside the leftcontent as a h1 tag... My logic was this. Get the title via the first for loop then once i get it start another for loop to get the content for that title. So would I get the title first in a single for loop ? then append then turn around and run the for loop for content, then append, then do the other for loop that looks for the <h2> stuff ?
Basically this works fine to get the none <h2> stuff with the exception of the title:
my question is how would i get something like this to work ?
Spoiler:
Code:
#-------------------------------------------------------
# this for loop is trying to get the title
for t_item in soup.findAll('div', {"class":"leftcontent"}):
print 't_item is: ', t_item
rawh1 = t_item.find('h1')
title = self.tag_to_string(rawh1)
print 'rawh1 title is: ', title
#indent might not show right on here but this should be
#an independent for loop
#=------------------------------------------------------
#-------------------next get the non <h2> content this works
for content in soup.findAll('div', {"class":"artIntroShort"}):
print 'The content is: ', content
art_text = content.find('p')
print 'Art_text is :', art_text
link = content.find('a')
print 'The link is :', link
url = self.INDEX + link['href']
print 'The URL is :', url
current_articles.append({'title': title, 'url': url, 'description':'', 'date':''}) # append all this
#------------------------------------------------------------------------
of course I have the return statements and all but this is the block that i'm concerned about and thanks
also. i'm noticing that there are <span> tags inside the <p> tags so when i do for a search for the <a> inside the <p> i get the dang links for the ads instead of the last <a> tag... this one i tell you is really working the brain. be interesting how this works out... I lookeed at the output log and notice like i said it keeps making the url is: to the ad.doubleclick thing that is inside the <span> i tried taking and doing a remove_tags on that tag but apparently it doesn't remove the tag till after it goes through the parsing.