Nfl Recipe -- Almost done need a little help

TonytheBookworm · 09-26-2010, 12:37 AM

Starson,
If you get a few minutes could you look at this code and maybe explain to me why I never get the pcard content (the photo with the players stats). I don't see where I'm removing it anywhere and I'm parsing the */printable/* link and that page has the pcard.
Thanks.

Spoiler:

Code:

from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ptempfile import PersistentTemporaryFile

class AdvancedUserRecipe1282101454(BasicNewsRecipe):
    title = 'NFL'
    language = 'en'
    __author__ = 'TonytheBookworm'
    description = 'National FootBall League Coverage'
    publisher = 'Tonythebookworm'
    category = 'sports, football, USA'
    oldest_article = 10
    max_articles_per_feed = 100
    no_stylesheets = True
    remove_javascript = True
    
    extra_css = '''
                    article-hdr{font-family:Arial,Helvetica,sans-serif; font-weight:bold;font-size:large;}
                    article-hdr-meta{text-align:right; font-size:small;margin-top:0px;margin-bottom: 0px;}
                    
                    article-hdr-meta-pub{text-align:right; font-size:small;margin-top:0px;margin-bottom: 0px;}
                    article-hdr-meta-updated{text-align:right; font-size:small;margin-top:0px;margin-bottom: 0px;}
                    
                    
                    
                    p{font-family:Helvetica,Arial,sans-serif;font-size:small;}
		        '''
    
    #masthead_url = 'http://gawand.org/wp-content/uploads/2010/06/ajc-logo.gif'
    #keep_only_tags    = [
     #                     dict(name='div', attrs={'id':['col1','article-hdr']})
      #                   ,dict(attrs={'class':['articleText']})
       #                 ]
                        
    remove_tags = [{'id':['print-ribbon']},
                   
                  ]     
    #remove_tags_after = dict(name='div', attrs={'style':['margin']})                 
    feeds          = [
                      ('NFL NEWS', 'http://www.nfl.com/rss/rsslanding?searchString=home'),
                      #('ARZ Cardinals', 'http://www.nfl.com/rss/rsslanding?searchString=team&abbr=ARZ'),
                      ('ATL Falcons',  'http://www.nfl.com/rss/rsslanding?searchString=team&abbr=ATL'),
                      
                     ]
    temp_files = []
    articles_are_obfuscated = True

    def get_article_url(self, article):
       return article.get('link',  None)

    def get_obfuscated_article(self, url):
        br = self.get_browser()
        br.open(url)
        response = br.follow_link(url_regex = r'/printable/', nr = 0)
        html = response.read()
        self.temp_files.append(PersistentTemporaryFile('_fa.html'))
        self.temp_files[-1].write(html)
        self.temp_files[-1].close()
        return self.temp_files[-1].name
        
    def postprocess_html(self, soup, first):
        for tag in soup.findAll(name=['li']):
            tag.name = 'div'
        return soup
        
    def preprocess_html(self, soup):
        for item in soup.findAll(attrs={'style':True}):
            del item['style']
        return soup

I see iframe is turned off by default. How do i turn it back on?

Starson17 · 09-26-2010, 08:36 AM

Quote:

Originally Posted by TonytheBookworm

Starson,
If you get a few minutes could you look at this code and maybe explain to me why I never get the pcard content (the photo with the players stats). I don't see where I'm removing it anywhere and I'm parsing the */printable/* link and that page has the pcard.

What makes you think the page the NFL sends you has pcard or iframe?
edit: Hint - the page that FireFox or IE gets sent is not necessarily the same as what Calibre is sent. It's time to get out TamperData.

TonytheBookworm · 09-26-2010, 12:36 PM

Quote:

Originally Posted by Starson17

What makes you think the page the NFL sends you has pcard or iframe?
edit: Hint - the page that FireFox or IE gets sent is not necessarily the same as what Calibre is sent. It's time to get out TamperData.

Okay, I downloaded Tamperdata (not 100 percent certain how to use it yet), but when I clicked on the print button for the article I seen a listing that has referrers listed.

Referer=http://www.nfl.com/news/story/09000d5d81acc392/article/broncos-rb-moreno-out-vs-colts-buckhalter-expected-to-start

That referer looks like nothing more than the current url. I then took and tried to figure this out and noticed you had a conversation with Kovid about this. So could you help me or maybe explain to me how to go about using this (or would i )?

Spoiler:

My first thought was to simply take in the
req = mechanize.Request(url, headers = {'Referer':'http://referer_site.com/'})
and change it to :
req = mechanize.Request(url, headers = {'Referer':url}) but i don't think that is right.

thanks by the way.

Starson17 · 09-26-2010, 12:50 PM

Quote:

Originally Posted by TonytheBookworm

Okay, I downloaded Tamperdata

1) You start with your recipe and make sure you're removing nothing, then print what the site sends you. You compare that to what you see in FireFox. If there's something that FireFox gets, but Calibre does not, then it's time to figure out why. That's where you are now (I assume you are sure that your recipe does not receive pcard, even with everything turned on).

2) Once you're sure that you're getting different things, you start tracking down how the site knows the difference between Calibre's request and FF's request. It could be useragent, headers, cookies, etc. TamperData (or Live HTTP Headers) will tell you what FireFox sends.
These commands inside get_browser will show you what Calibre sends:

Code:

        # Print HTTP headers.
        br.set_debug_http(True)
        br.set_debug_responses(True)
        br.set_debug_redirects(True)

Your job is to compare FF to C. TamperData will let you change what FF sends. It's easier to get FF to mimic Calibre because TD will let you change headers just before they are sent, but you can go either way. Eventually, you'll figure out why there's a difference between what the site sends FF and what it sends C.

As an example, I ran into this problem with a Skeptic Blog - I got a Bad Behavior error. It turned out the site wanted an Accept header. I also ran into it with a Comic recipe. That turned out that it wanted a referer header, etc.

TonytheBookworm · 09-26-2010, 10:36 PM

Quote:

Originally Posted by Starson17

1) You start with your recipe and make sure you're removing nothing, then print what the site sends you. You compare that to what you see in FireFox. If there's something that FireFox gets, but Calibre does not, then it's time to figure out why. That's where you are now (I assume you are sure that your recipe does not receive pcard, even with everything turned on).

2) Once you're sure that you're getting different things, you start tracking down how the site knows the difference between Calibre's request and FF's request. It could be useragent, headers, cookies, etc. TamperData (or Live HTTP Headers) will tell you what FireFox sends.
These commands inside get_browser will show you what Calibre sends:

Code:

        # Print HTTP headers.
        br.set_debug_http(True)
        br.set_debug_responses(True)
        br.set_debug_redirects(True)

Your job is to compare FF to C. TamperData will let you change what FF sends. It's easier to get FF to mimic Calibre because TD will let you change headers just before they are sent, but you can go either way. Eventually, you'll figure out why there's a difference between what the site sends FF and what it sends C.

As an example, I ran into this problem with a Skeptic Blog - I got a Bad Behavior error. It turned out the site wanted an Accept header. I also ran into it with a Comic recipe. That turned out that it wanted a referer header, etc.

Alright, this one is kicking my rear. I have a question about the #print http headers. Where exactly will that information be output? do i need to make print statements? Will it be inside the myrecipe.txt or do i need to do something special ?

Starson17 · 09-27-2010, 08:23 AM

Quote:

Originally Posted by TonytheBookworm

Alright, this one is kicking my rear. I have a question about the #print http headers. Where exactly will that information be output? do i need to make print statements? Will it be inside the myrecipe.txt or do i need to do something special ?

IIRC, those setting will automatically print the headers and responses to stdout and your redirect >test.txt on the ebook_convert command line will capture them to show you the header/handshaking that you can compare to TamperData or LiveHttpHeaders.

BTW, I'm not saying that the headers are definitely your problem. For all I know the missing part is built by script or flash, or Ajax, etc. It's up to you to find out where the missing stuff is coming from. It's just that after everything else is eliminated, when you see one thing in FF and another in your printed soup, it's often because the site is actually sending two different things, and that's usually due to a diff in the headers sent by FF vs. Calibre.

TonytheBookworm · 09-27-2010, 06:43 PM

Okay after messing with this for a while I finally figured out why the pcard is not showing up. Yet, I don't know how exactly to fix it. So could you hook the jumper cables to me and give me a jump-start please ?

When using liveHttp and tamperData i noticed that a request is sent out for
http://www.nfl.com/widget/playercard...n=2010&gameId= (which turns out to be the pcard data)

So my question is: do i add that as an addheader? or is it a br.open('http://www.nfl.com/widget/playercard?esbId=EDW720778&season=2010&gameId=') ?

Sorry for all the questions but i'm totally in the dark on this one

Starson17 · 09-27-2010, 07:50 PM

Quote:

Originally Posted by TonytheBookworm

Okay after messing with this for a while I finally figured out why the pcard is not showing up. Yet, I don't know how exactly to fix it. So could you hook the jumper cables to me and give me a jump-start please ?

When using liveHttp and tamperData i noticed that a request is sent out for
http://www.nfl.com/widget/playercard...n=2010&gameId= (which turns out to be the pcard data)

So my question is: do i add that as an addheader? or is it a br.open('http://www.nfl.com/widget/playercard?esbId=EDW720778&season=2010&gameId=') ?

Sorry for all the questions but i'm totally in the dark on this one

You're close. Note that "esbId=EDW72077" is the player ID. The player ID is in the iframe part of the page you're scraping. Here's code grabbed from a print in the recipe:

Code:

<iframe src="/widget/playercard?esbId=NOR780922&amp;season=2010&amp;gameId=" id="pcard-EOCVFPSS" frameborder="0"></iframe>

You just build the URL, grab the soup with:

Code:

soup = self.index_to_soup(URL)

then put it into your soup of the page where you want it.

TonytheBookworm · 09-27-2010, 10:25 PM

Quote:

Originally Posted by Starson17

You're close. Note that "esbId=EDW72077" is the player ID. The player ID is in the iframe part of the page you're scraping. Here's code grabbed from a print in the recipe:

Code:

<iframe src="/widget/playercard?esbId=NOR780922&amp;season=2010&amp;gameId=" id="pcard-EOCVFPSS" frameborder="0"></iframe>

You just build the URL, grab the soup with:

Code:

soup = self.index_to_soup(URL)

then put it into your soup of the page where you want it.

1 question
1) I found the iframe

Code:

<div class="articleText"> <p>CHICAGO -- The Bears say they will hold defensive tackle <a href="/players/tommieharris/profile?id=HAR548445">Tommie Harris</a> out of Monday night's game against the <a href="/teams/greenbaypackers/profile?team=GB">Green Bay Packers</a> on a coach's decision.</p> <p>
<div class="pcard-wrapper  nfl-tag-right" id="pcard-JMEDKDWV-wrapper">
<iframe src="/widget/playercard?esbId=HAR548445&amp;season=2010&amp;gameId=" id="pcard-JMEDKDWV" frameborder="0"></iframe>
</div>

1) you said build the url, then put it into the soup wherever i want it. Can you point me to a recipe that does this or enlighten me ? I might have even doing it in the past but i'm having memory lapse if i have. thanks

something like this maybe? :confused

Spoiler:

Just not grasping this one yet

Starson17 · 09-28-2010, 09:24 AM

Quote:

Originally Posted by TonytheBookworm

something like this maybe? :confused
...
Just not grasping this one yet

Without setting up your recipe and running it, I can't be sure of the details, but yes, that's the basic idea. You found the pcard part of the URL in the main soup and constructed the link (URL) that you needed. You used soup = self.index_to_soup(URL) to grab that page and turn it into a soup. Now you have to extract() the tag_from_newsoup for whatever you need/want.

You don't want the <head>, etc. I haven't looked at that page, so I can't tell you exactly what or how much you'll want in tag_from_newsoup, but you know how to do that.

Once tag_from_newsoup is extracted, you can either soup.insert(wherever, tag_from_newsoup) or use replaceWith. I know you've used both of them previously. You might just use replaceWith on the <iframe> tag.

So you lied when you said "#no clue on this"

You've got most of it, it's just putting it all together (Do I get partial author credit on this - writing all these posts is harder than just writing the recipe

)

TonytheBookworm · 09-28-2010, 11:48 AM

Quote:

Originally Posted by Starson17

Without setting up your recipe and running it, I

You've got most of it, it's just putting it all together (Do I get partial author credit on this - writing all these posts is harder than just writing the recipe

)

Man, I'll give you full credit if you want. It doesn't matter to me because without your help i wouldn't be doing this.

Also, how do you get the src from a tag?

I also been thinking is this recipe worth all the trouble, so it might be a while before it gets complete.

Starson17 · 09-28-2010, 12:01 PM

Quote:

Originally Posted by TonytheBookworm

Man, I'll give you full credit if you want.

Quote:

Also, how do you get the src from a tag?

As in the src link from an <img> tag inside another tag called item in the soup?
Do this: item.img['src']

Quote:

I also been thinking is this recipe worth all the trouble, so it might be a while before it gets complete.

I wondered when you were going to get to that point!

I find it more fun to figure out how to do the recipe than to actually write it. It's your recipe, not mine, so you're the author (if you ever finish the grunt work and get it functioning).

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
I need some help with a recipe	jefferson_frantz	Recipes	14	11-22-2010 02:06 PM
New recipe	kiklop74	Recipes	0	10-01-2010 02:42 PM
Recipe Help	lrain5	Calibre	3	05-09-2010 10:42 PM
Recipe Help	hellonewman	Calibre	1	01-23-2010 03:45 AM

09-27-2010, 06:43 PM	#7
TonytheBookworm Addict Posts: 264 Karma: 62 Join Date: May 2010 Device: kindle 2, kindle 3, Kindle fire	Okay after messing with this for a while I finally figured out why the pcard is not showing up. Yet, I don't know how exactly to fix it. So could you hook the jumper cables to me and give me a jump-start please ? When using liveHttp and tamperData i noticed that a request is sent out for http://www.nfl.com/widget/playercard...n=2010&gameId= (which turns out to be the pcard data) So my question is: do i add that as an addheader? or is it a br.open('http://www.nfl.com/widget/playercard?esbId=EDW720778&season=2010&gameId=') ? Sorry for all the questions but i'm totally in the dark on this one

Advert

Advert