Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 09-26-2010, 12:37 AM   #1
TonytheBookworm
Addict
TonytheBookworm is on a distinguished road
 
TonytheBookworm's Avatar
 
Posts: 264
Karma: 62
Join Date: May 2010
Device: kindle 2, kindle 3, Kindle fire
Nfl Recipe -- Almost done need a little help

Starson,
If you get a few minutes could you look at this code and maybe explain to me why I never get the pcard content (the photo with the players stats). I don't see where I'm removing it anywhere and I'm parsing the */printable/* link and that page has the pcard.
Thanks.
Spoiler:

Code:
from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ptempfile import PersistentTemporaryFile

class AdvancedUserRecipe1282101454(BasicNewsRecipe):
    title = 'NFL'
    language = 'en'
    __author__ = 'TonytheBookworm'
    description = 'National FootBall League Coverage'
    publisher = 'Tonythebookworm'
    category = 'sports, football, USA'
    oldest_article = 10
    max_articles_per_feed = 100
    no_stylesheets = True
    remove_javascript = True
    
    extra_css = '''
                    article-hdr{font-family:Arial,Helvetica,sans-serif; font-weight:bold;font-size:large;}
                    article-hdr-meta{text-align:right; font-size:small;margin-top:0px;margin-bottom: 0px;}
                    
                    article-hdr-meta-pub{text-align:right; font-size:small;margin-top:0px;margin-bottom: 0px;}
                    article-hdr-meta-updated{text-align:right; font-size:small;margin-top:0px;margin-bottom: 0px;}
                    
                    
                    
                    p{font-family:Helvetica,Arial,sans-serif;font-size:small;}
		        '''
    
    #masthead_url = 'http://gawand.org/wp-content/uploads/2010/06/ajc-logo.gif'
    #keep_only_tags    = [
     #                     dict(name='div', attrs={'id':['col1','article-hdr']})
      #                   ,dict(attrs={'class':['articleText']})
       #                 ]
                        
    remove_tags = [{'id':['print-ribbon']},
                   
                  ]     
    #remove_tags_after = dict(name='div', attrs={'style':['margin']})                 
    feeds          = [
                      ('NFL NEWS', 'http://www.nfl.com/rss/rsslanding?searchString=home'),
                      #('ARZ Cardinals', 'http://www.nfl.com/rss/rsslanding?searchString=team&abbr=ARZ'),
                      ('ATL Falcons',  'http://www.nfl.com/rss/rsslanding?searchString=team&abbr=ATL'),
                      
                     ]
    temp_files = []
    articles_are_obfuscated = True

    def get_article_url(self, article):
       return article.get('link',  None)

    def get_obfuscated_article(self, url):
        br = self.get_browser()
        br.open(url)
        response = br.follow_link(url_regex = r'/printable/', nr = 0)
        html = response.read()
        self.temp_files.append(PersistentTemporaryFile('_fa.html'))
        self.temp_files[-1].write(html)
        self.temp_files[-1].close()
        return self.temp_files[-1].name
        
    def postprocess_html(self, soup, first):
        for tag in soup.findAll(name=['li']):
            tag.name = 'div'
        return soup
        
    def preprocess_html(self, soup):
        for item in soup.findAll(attrs={'style':True}):
            del item['style']
        return soup

I see iframe is turned off by default. How do i turn it back on?

Last edited by TonytheBookworm; 09-26-2010 at 12:48 AM.
TonytheBookworm is offline   Reply With Quote
Old 09-26-2010, 08:36 AM   #2
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by TonytheBookworm View Post
Starson,
If you get a few minutes could you look at this code and maybe explain to me why I never get the pcard content (the photo with the players stats). I don't see where I'm removing it anywhere and I'm parsing the */printable/* link and that page has the pcard.
What makes you think the page the NFL sends you has pcard or iframe?
edit: Hint - the page that FireFox or IE gets sent is not necessarily the same as what Calibre is sent. It's time to get out TamperData.
Starson17 is offline   Reply With Quote
Advert
Old 09-26-2010, 12:36 PM   #3
TonytheBookworm
Addict
TonytheBookworm is on a distinguished road
 
TonytheBookworm's Avatar
 
Posts: 264
Karma: 62
Join Date: May 2010
Device: kindle 2, kindle 3, Kindle fire
Quote:
Originally Posted by Starson17 View Post
What makes you think the page the NFL sends you has pcard or iframe?
edit: Hint - the page that FireFox or IE gets sent is not necessarily the same as what Calibre is sent. It's time to get out TamperData.
Okay, I downloaded Tamperdata (not 100 percent certain how to use it yet), but when I clicked on the print button for the article I seen a listing that has referrers listed.

Referer=http://www.nfl.com/news/story/09000d5d81acc392/article/broncos-rb-moreno-out-vs-colts-buckhalter-expected-to-start

That referer looks like nothing more than the current url. I then took and tried to figure this out and noticed you had a conversation with Kovid about this. So could you help me or maybe explain to me how to go about using this (or would i )?

Spoiler:

Code:
def get_browser(self):
      br = BasicNewsRecipe.get_browser(self)
      orig_open_novisit = br.open_novisit

      def my_open_no_visit(url, **kwargs):
       req = mechanize.Request(url, headers = {'Referer':'http://referer_site.com/'})
       return orig_open_novisit(req)
      
     br.open_novisit = my_open_no_visit
     return br


My first thought was to simply take in the
req = mechanize.Request(url, headers = {'Referer':'http://referer_site.com/'})
and change it to :
req = mechanize.Request(url, headers = {'Referer':url}) but i don't think that is right.

thanks by the way.
TonytheBookworm is offline   Reply With Quote
Old 09-26-2010, 12:50 PM   #4
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by TonytheBookworm View Post
Okay, I downloaded Tamperdata
1) You start with your recipe and make sure you're removing nothing, then print what the site sends you. You compare that to what you see in FireFox. If there's something that FireFox gets, but Calibre does not, then it's time to figure out why. That's where you are now (I assume you are sure that your recipe does not receive pcard, even with everything turned on).

2) Once you're sure that you're getting different things, you start tracking down how the site knows the difference between Calibre's request and FF's request. It could be useragent, headers, cookies, etc. TamperData (or Live HTTP Headers) will tell you what FireFox sends.
These commands inside get_browser will show you what Calibre sends:
Code:
        # Print HTTP headers.
        br.set_debug_http(True)
        br.set_debug_responses(True)
        br.set_debug_redirects(True)
Your job is to compare FF to C. TamperData will let you change what FF sends. It's easier to get FF to mimic Calibre because TD will let you change headers just before they are sent, but you can go either way. Eventually, you'll figure out why there's a difference between what the site sends FF and what it sends C.

As an example, I ran into this problem with a Skeptic Blog - I got a Bad Behavior error. It turned out the site wanted an Accept header. I also ran into it with a Comic recipe. That turned out that it wanted a referer header, etc.
Starson17 is offline   Reply With Quote
Old 09-26-2010, 10:36 PM   #5
TonytheBookworm
Addict
TonytheBookworm is on a distinguished road
 
TonytheBookworm's Avatar
 
Posts: 264
Karma: 62
Join Date: May 2010
Device: kindle 2, kindle 3, Kindle fire
Quote:
Originally Posted by Starson17 View Post
1) You start with your recipe and make sure you're removing nothing, then print what the site sends you. You compare that to what you see in FireFox. If there's something that FireFox gets, but Calibre does not, then it's time to figure out why. That's where you are now (I assume you are sure that your recipe does not receive pcard, even with everything turned on).

2) Once you're sure that you're getting different things, you start tracking down how the site knows the difference between Calibre's request and FF's request. It could be useragent, headers, cookies, etc. TamperData (or Live HTTP Headers) will tell you what FireFox sends.
These commands inside get_browser will show you what Calibre sends:
Code:
        # Print HTTP headers.
        br.set_debug_http(True)
        br.set_debug_responses(True)
        br.set_debug_redirects(True)
Your job is to compare FF to C. TamperData will let you change what FF sends. It's easier to get FF to mimic Calibre because TD will let you change headers just before they are sent, but you can go either way. Eventually, you'll figure out why there's a difference between what the site sends FF and what it sends C.

As an example, I ran into this problem with a Skeptic Blog - I got a Bad Behavior error. It turned out the site wanted an Accept header. I also ran into it with a Comic recipe. That turned out that it wanted a referer header, etc.
Alright, this one is kicking my rear. I have a question about the #print http headers. Where exactly will that information be output? do i need to make print statements? Will it be inside the myrecipe.txt or do i need to do something special ?
TonytheBookworm is offline   Reply With Quote
Advert
Old 09-27-2010, 08:23 AM   #6
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by TonytheBookworm View Post
Alright, this one is kicking my rear. I have a question about the #print http headers. Where exactly will that information be output? do i need to make print statements? Will it be inside the myrecipe.txt or do i need to do something special ?
IIRC, those setting will automatically print the headers and responses to stdout and your redirect >test.txt on the ebook_convert command line will capture them to show you the header/handshaking that you can compare to TamperData or LiveHttpHeaders.

BTW, I'm not saying that the headers are definitely your problem. For all I know the missing part is built by script or flash, or Ajax, etc. It's up to you to find out where the missing stuff is coming from. It's just that after everything else is eliminated, when you see one thing in FF and another in your printed soup, it's often because the site is actually sending two different things, and that's usually due to a diff in the headers sent by FF vs. Calibre.
Starson17 is offline   Reply With Quote
Old 09-27-2010, 06:43 PM   #7
TonytheBookworm
Addict
TonytheBookworm is on a distinguished road
 
TonytheBookworm's Avatar
 
Posts: 264
Karma: 62
Join Date: May 2010
Device: kindle 2, kindle 3, Kindle fire
Okay after messing with this for a while I finally figured out why the pcard is not showing up. Yet, I don't know how exactly to fix it. So could you hook the jumper cables to me and give me a jump-start please ?

When using liveHttp and tamperData i noticed that a request is sent out for
http://www.nfl.com/widget/playercard...n=2010&gameId= (which turns out to be the pcard data)

So my question is: do i add that as an addheader? or is it a br.open('http://www.nfl.com/widget/playercard?esbId=EDW720778&season=2010&gameId=') ?

Sorry for all the questions but i'm totally in the dark on this one
TonytheBookworm is offline   Reply With Quote
Old 09-27-2010, 07:50 PM   #8
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by TonytheBookworm View Post
Okay after messing with this for a while I finally figured out why the pcard is not showing up. Yet, I don't know how exactly to fix it. So could you hook the jumper cables to me and give me a jump-start please ?

When using liveHttp and tamperData i noticed that a request is sent out for
http://www.nfl.com/widget/playercard...n=2010&gameId= (which turns out to be the pcard data)

So my question is: do i add that as an addheader? or is it a br.open('http://www.nfl.com/widget/playercard?esbId=EDW720778&season=2010&gameId=') ?

Sorry for all the questions but i'm totally in the dark on this one
You're close. Note that "esbId=EDW72077" is the player ID. The player ID is in the iframe part of the page you're scraping. Here's code grabbed from a print in the recipe:
Code:
<iframe src="/widget/playercard?esbId=NOR780922&amp;season=2010&amp;gameId=" id="pcard-EOCVFPSS" frameborder="0"></iframe>
You just build the URL, grab the soup with:
Code:
soup = self.index_to_soup(URL)
then put it into your soup of the page where you want it.
Starson17 is offline   Reply With Quote
Old 09-27-2010, 10:25 PM   #9
TonytheBookworm
Addict
TonytheBookworm is on a distinguished road
 
TonytheBookworm's Avatar
 
Posts: 264
Karma: 62
Join Date: May 2010
Device: kindle 2, kindle 3, Kindle fire
Quote:
Originally Posted by Starson17 View Post
You're close. Note that "esbId=EDW72077" is the player ID. The player ID is in the iframe part of the page you're scraping. Here's code grabbed from a print in the recipe:
Code:
<iframe src="/widget/playercard?esbId=NOR780922&amp;season=2010&amp;gameId=" id="pcard-EOCVFPSS" frameborder="0"></iframe>
You just build the URL, grab the soup with:
Code:
soup = self.index_to_soup(URL)
then put it into your soup of the page where you want it.
1 question
1) I found the iframe
Code:
<div class="articleText"> <p>CHICAGO -- The Bears say they will hold defensive tackle <a href="/players/tommieharris/profile?id=HAR548445">Tommie Harris</a> out of Monday night's game against the <a href="/teams/greenbaypackers/profile?team=GB">Green Bay Packers</a> on a coach's decision.</p> <p>
<div class="pcard-wrapper  nfl-tag-right" id="pcard-JMEDKDWV-wrapper">
<iframe src="/widget/playercard?esbId=HAR548445&amp;season=2010&amp;gameId=" id="pcard-JMEDKDWV" frameborder="0"></iframe>
</div>
1) you said build the url, then put it into the soup wherever i want it. Can you point me to a recipe that does this or enlighten me ? I might have even doing it in the past but i'm having memory lapse if i have. thanks

something like this maybe? :confused
Spoiler:

Code:
def preprocess_html(self, soup):
        for item in soup.findAll(attrs={'style':True}):
            del item['style']
        print ' FIRST SOUP is: ', soup
        for pcard in soup.findAll(name='div', attrs={'class':'pcard-wrapper  nfl-tag-right'}):
            widget = pcard.find('iframe')
            print 'HEY W: ', widget
            pcard_url = widget.src
            print 'HERES THE PCARD_URL', pcard_url
            URL = 'http://www.nfl.com' + pcard_url
            newsoup = self.index_to_soup(URL)
            print 'here is the new soup: ', newsoup
            soup.insert(0, newsoup) #no clue on this but maybe 
       
        return soup


Just not grasping this one yet

Last edited by TonytheBookworm; 09-27-2010 at 11:21 PM. Reason: still pluggin
TonytheBookworm is offline   Reply With Quote
Old 09-28-2010, 09:24 AM   #10
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by TonytheBookworm View Post
something like this maybe? :confused
...
Just not grasping this one yet
Without setting up your recipe and running it, I can't be sure of the details, but yes, that's the basic idea. You found the pcard part of the URL in the main soup and constructed the link (URL) that you needed. You used soup = self.index_to_soup(URL) to grab that page and turn it into a soup. Now you have to extract() the tag_from_newsoup for whatever you need/want.

You don't want the <head>, etc. I haven't looked at that page, so I can't tell you exactly what or how much you'll want in tag_from_newsoup, but you know how to do that.

Once tag_from_newsoup is extracted, you can either soup.insert(wherever, tag_from_newsoup) or use replaceWith. I know you've used both of them previously. You might just use replaceWith on the <iframe> tag.

So you lied when you said "#no clue on this"

You've got most of it, it's just putting it all together (Do I get partial author credit on this - writing all these posts is harder than just writing the recipe )
Starson17 is offline   Reply With Quote
Old 09-28-2010, 11:48 AM   #11
TonytheBookworm
Addict
TonytheBookworm is on a distinguished road
 
TonytheBookworm's Avatar
 
Posts: 264
Karma: 62
Join Date: May 2010
Device: kindle 2, kindle 3, Kindle fire
Quote:
Originally Posted by Starson17 View Post
Without setting up your recipe and running it, I

You've got most of it, it's just putting it all together (Do I get partial author credit on this - writing all these posts is harder than just writing the recipe )
Man, I'll give you full credit if you want. It doesn't matter to me because without your help i wouldn't be doing this.

Also, how do you get the src from a tag?

I also been thinking is this recipe worth all the trouble, so it might be a while before it gets complete.
TonytheBookworm is offline   Reply With Quote
Old 09-28-2010, 12:01 PM   #12
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by TonytheBookworm View Post
Man, I'll give you full credit if you want.


Quote:
Also, how do you get the src from a tag?
As in the src link from an <img> tag inside another tag called item in the soup?
Do this: item.img['src']

Quote:
I also been thinking is this recipe worth all the trouble, so it might be a while before it gets complete.
I wondered when you were going to get to that point!

I find it more fun to figure out how to do the recipe than to actually write it. It's your recipe, not mine, so you're the author (if you ever finish the grunt work and get it functioning).
Starson17 is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
I need some help with a recipe jefferson_frantz Recipes 14 11-22-2010 02:06 PM
New recipe kiklop74 Recipes 0 10-01-2010 02:42 PM
Recipe Help lrain5 Calibre 3 05-09-2010 10:42 PM
Recipe Help hellonewman Calibre 1 01-23-2010 03:45 AM


All times are GMT -4. The time now is 05:30 AM.


MobileRead.com is a privately owned, operated and funded community.