ESPN recipe fails

NSILMike · 12-10-2018, 12:12 PM

(the one by Kovid and Raman)

Trying to get latest version of recipe: espn
Python function terminated unexpectedly
HTTP Error 401: Unauthorized (Error Code: 1)
Traceback (most recent call last):
File "site.py", line 101, in main
File "site.py", line 78, in run_entry_point
File "site-packages\calibre\utils\ipc\worker.py", line 199, in main
File "site-packages\calibre\gui2\convert\gui_conversion.py", line 35, in gui_convert_recipe
File "site-packages\calibre\gui2\convert\gui_conversion.py", line 27, in gui_convert
File "site-packages\calibre\ebooks\conversion\plumber.py", line 1106, in run
File "site-packages\calibre\customize\conversion.py", line 244, in __call__
File "site-packages\calibre\ebooks\conversion\plugins\recipe_ input.py", line 135, in convert
File "site-packages\calibre\web\feeds\news.py", line 901, in __init__
File "<string>", line 82, in get_browser
File "site-packages\mechanize\_mechanize.py", line 254, in open
File "site-packages\mechanize\_mechanize.py", line 310, in _mech_open
mechanize._response.httperror_seek_wrapper: HTTP Error 401: Unauthorized

kovidgoyal · 12-11-2018, 01:18 AM

I need ESPN account credentials to look at that.

NSILMike · 12-11-2018, 08:32 AM

You can create a login, or log in with Facebook. And if you don't have a login how did you originally create the recipe?

NSILMike · 12-20-2018, 12:06 PM

Quote:

Originally Posted by kovidgoyal

I need ESPN account credentials to look at that.

See my prior reply.

NSILMike · 12-21-2018, 10:49 AM

Just downloaded Calibre 3.36 which says ESPN recipe is improved. Now it doesn't fail, but it downloads only links...

kovidgoyal · 12-21-2018, 01:14 PM

yeah I looked at it briefly, ESPN uses a complicated javascript based mechanism to login,which I dont have the time/interest to reverse engineer.

NSILMike · 12-21-2018, 01:18 PM

Quote:

Originally Posted by kovidgoyal

yeah I looked at it briefly, ESPN uses a complicated javascript based mechanism to login,which I dont have the time/interest to reverse engineer.

Thanks, I appreciate your efforts.

biffhero · 08-06-2020, 11:45 PM

Quote:

Originally Posted by kovidgoyal

yeah I looked at it briefly, ESPN uses a complicated javascript based mechanism to login,which I dont have the time/interest to reverse engineer.

I think we can get a little closer if we don't try to log in to the site.

I don't know anything about calibre, but I did find some information that I think might be of help.

In the file ./resources/builtin_recipes.zip I found a file called espn.recipe.

Looking at it, and looking at the web site, I tried this web page:

http://sports.espn.go.com/espn/rss/nfl/news

which gave me a bunch of stuff, including a URL that looked like this:

https://www.espn.com/nfl/story/_/id/...king-full-list

I saw on line 109 something that looked interesting, so I tried to go to this page to get the story.

http://sports.espn.go.com/espn/print?id=29533526 which seems to work pretty well.

Looking at line 115, I saw that this sort of an URL was an interesting idea.

https://www.espn.com/espn/print?id=29533526&type=story

And that one works as well.

Maybe this is all that needs to be changed?

Thanks,
Rob

biffhero · 08-07-2020, 12:00 AM

Well, I don't know how to know if it is working or not. It is not downloading because of age issues that I don't understand.

I am getting this message a lot:

Skipping article Bubbles are working for other sports. Why did the NFL decide against one? (Tue, 28 Jul, 2020 11:04) from feed www.espn.com - NFL as it is too old.

I'll keep poking around for a cache somewhere.

Thanks,
Rob

kovidgoyal · 08-07-2020, 12:12 AM

Set oldest_article in the recipe to control that. And you dont need to look in builtin_recipes.zip to edit recipes, calibre has UI for that. https://manual.calibre-ebook.com/news.html

biffhero · 08-07-2020, 03:10 AM

Thank you!

That was exactly where I needed to start. I copied some things from the other espn script, and other things I don't know what they do enough to copy them over and understand what is going on. Here's my script for now, in case anyone else wants to use it.

Code:

#!/usr/bin/env python2
# vim:fileencoding=utf-8
from __future__ import unicode_literals, division, absolute_import, print_function
from calibre.web.feeds.news import BasicNewsRecipe

class AdvancedUserRecipe1596778396(BasicNewsRecipe):
    title          = 'espn_modified'
    description = 'Sports news'
    __author__ = 'Rob Walker'
    language = 'en'
    no_stylesheets = True
    use_embedded_content = False
    remove_javascript = True
    encoding = 'ISO-8859-1'
    oldest_article = 7
    max_articles_per_feed = 100
    auto_cleanup   = True

    remove_tags_before = dict(name='font', attrs={'class': 'date'})
    remove_tags = [
        dict(name='font', attrs={'class': 'footer'}), dict(
            name='hr', noshade='noshade'),
        dict(name='img', src='/winnercomm/horseracing/DRF.jpg')
    ]

    extra_css = '''
                body{font-family:Verdana,Arial,Helvetica,sans-serif; font-size:x-small; font-weight:normal;}
                .subhead{color:#666666;font-family:Verdana,sans-serif; font-size:x-small; font-weight:bold;}
                .clearfix{font-family:Verdana,sans-serif; font-size:xx-small; }
                .date{ font-family:Verdana,Arial,Helvetica,sans-serif ; font-size:xx-small;color:#7A7A7A;}
                .byline{ font-family:Verdana,Arial,Helvetica,sans-serif ; font-size:xx-small;color:#666666;}
                .headline{font-family:Verdana,Arial,Helvetica,sans-serif ; font-size:large; font-weight:bold;}
                '''
    
    feeds          = [
        ('Top Headlines', 'https://www.espn.com/espn/rss/news'),
        ('NFL', 'https://www.espn.com/espn/rss/nfl/news'),
        ('NBA', 'https://www.espn.com/espn/rss/nba/news'),
        ('MLB', 'https://www.espn.com/espn/rss/mlb/news'),
        ('NHL', 'https://www.espn.com/espn/rss/nhl/news'),
        ('Golf', 'https://www.espn.com/espn/rss/golf/news'),
        ('RPM', 'https://www.espn.com/espn/rss/rpm/news'),
        ('Boxing', 'https://www.espn.com/espn/rss/boxing/news'),
        ('Soccer', 'https://www.espn.com/espn/rss/soccer/news'),
        ('NCB', 'https://www.espn.com/espn/rss/ncb/news'),
        ('NCF', 'https://www.espn.com/espn/rss/ncf/news'),
        ('NCAA', 'https://www.espn.com/espn/rss/ncaa/news'),
        ('Olympics', 'https://www.espn.com/espn/rss/oly/news'),
        ('Equestrian', 'https://www.espn.com/espn/rss/horse/news'),
    ]
    
    def preprocess_html(self, soup):
        for div in soup.findAll('div', style=True):
            if 'px' in div['style']:
                div['style'] = ''

        return soup

    def postprocess_html(self, soup, first_fetch):
        for div in soup.findAll('div', style=True):
            div['style'] = div['style'].replace('center', 'left')

        return soup

biffhero · 08-07-2020, 03:02 PM

OK, I'm starting to understand how this stuff works. I think I'm making progress, but I'm not sure.

The base URL has changed.

feeds = [
('Top Headlines', 'http://sports.espn.go.com/espn/rss/news'),
'http://sports.espn.go.com/espn/rss/nfl/news',
'http://sports.espn.go.com/espn/rss/nba/news',
'http://sports.espn.go.com/espn/rss/mlb/news',
'http://sports.espn.go.com/espn/rss/nhl/news',
'http://sports.espn.go.com/espn/rss/golf/news',
'http://sports.espn.go.com/espn/rss/rpm/news',
'http://sports.espn.go.com/espn/rss/tennis/news',
'http://sports.espn.go.com/espn/rss/boxing/news',
'http://soccernet.espn.go.com/rss/news',
'http://sports.espn.go.com/espn/rss/ncb/news',
'http://sports.espn.go.com/espn/rss/ncf/news',
'http://sports.espn.go.com/espn/rss/ncaa/news',
'http://sports.espn.go.com/espn/rss/outdoors/news',
# 'http://sports.espn.go.com/espn/rss/bassmaster/news',
'http://sports.espn.go.com/espn/rss/oly/news',
'http://sports.espn.go.com/espn/rss/horse/news'
]

Therefore, in print_version() we need

return 'http://sports.espn.go.com/espn/print?' + match.group(1) + '&type=story'

However, where I'm getting confused is where we get "match" setup.

When we land inside of print_version, the variable "url" is holding the number. For instance, this is a good URL. https://www.espn.com/espn/print?id=29581539&type=story But the 'url' variable is coming in with '29581539', and the 'match' variable is completely empty.

My current attempt has this in print_version(), which isn't working.

def print_version(self, url):
if 'eticket' in url:
return url.partition('&')[0].replace('story?', 'print?')
match = re.search(r'story\?(id=\d+)', url)
self.log.debug('url: %s' % (url))
self.log.debug('match: %s' % (match.group(1)))
match = 1
articleId = url
if match and 'soccernet' not in url and 'bassmaster' not in url:
# return 'http://sports.espn.go.com/espn/print?' + match.group(1) + '&type=story'

self.log.debug('i: %s' % (match.group(1)))

# https://www.espn.com/espn/print?id=29581539&type=story
# return 'http://www.espn.com/espn/print?' + match.group(1) + '&type=story'

I'll keep applying head to wall, but if this helps someone else get closer, that's good.

biffhero · 08-07-2020, 03:15 PM

Sigh, that was completely wrong. Here is the correct information.

-------------

OK, I'm starting to understand how this stuff works. I think I'm making progress, but I'm not sure.

The base URL has changed.

feeds = [
('Top Headlines', 'https://www.espn.com/espn/rss/news'),
'https://www.espn.com/espn/rss/nfl/news',
'https://www.espn.com/espn/rss/nba/news',
'https://www.espn.com/espn/rss/mlb/news',
'https://www.espn.com/espn/rss/nhl/news',
'https://www.espn.com/espn/rss/golf/news',
'https://www.espn.com/espn/rss/rpm/news',
'https://www.espn.com/espn/rss/tennis/news',
'https://www.espn.com/espn/rss/boxing/news',
'https://www.espn.com/espn/rss/soccer/news',
# 'http://soccernet.espn.go.com/rss/news',
'https://www.espn.com/espn/rss/ncb/news',
'https://www.espn.com/espn/rss/ncf/news',
'https://www.espn.com/espn/rss/ncaa/news',
# 'https://www.espn.com/espn/rss/outdoors/news',
# 'http://sports.espn.go.com/espn/rss/bassmaster/news',
'https://www.espn.com/espn/rss/oly/news',
'https://www.espn.com/espn/rss/horse/news'
]

Therefore, in print_version() we need

return 'http://www.espn.com/espn/print?id=' + articleId + '&type=story'

However, where I'm getting confused is where we get "match" setup.

When we land inside of print_version, the variable "url" is holding the number. For instance, this is a good URL. https://www.espn.com/espn/print?id=29581539&type=story But the 'url' variable is coming in with '29581539', and the 'match' variable is completely empty.

My current attempt has this in print_version(), which isn't working.

def print_version(self, url):
if 'eticket' in url:
return url.partition('&')[0].replace('story?', 'print?')
match = re.search(r'story\?(id=\d+)', url)
self.log.debug('url: %s' % (url))
self.log.debug('match: %s' % (match.group(1)))
match = 1
articleId = url
if match and 'soccernet' not in url and 'bassmaster' not in url:
# return 'http://sports.espn.go.com/espn/print?' + match.group(1) + '&type=story'

self.log.debug('i: %s' % (match.group(1)))

# https://www.espn.com/espn/print?id=29581539&type=story
# return 'http://www.espn.com/espn/print?' + match.group(1) + '&type=story'
return 'http://www.espn.com/espn/print?id=' + articleId + '&type=story'

I'll keep applying head to wall, but if this helps someone else get closer, that's good.

kovidgoyal · 08-08-2020, 07:33 AM

There you go: https://github.com/kovidgoyal/calibr...1b1879f8d2d13f

biffhero · 08-20-2020, 09:48 PM

I was out of town for a week, and I'm just getting back to this.

This works perfectly, thank you! I have imported it to ESPN_master, and it works great.

Thank you again,
Rob

12-10-2018, 12:12 PM	#1
NSILMike Guru Posts: 735 Karma: 35936 Join Date: Apr 2011 Location: Shrewsury, MA Device: Lenovo Android Tablet	ESPN recipe fails (the one by Kovid and Raman) Trying to get latest version of recipe: espn Python function terminated unexpectedly HTTP Error 401: Unauthorized (Error Code: 1) Traceback (most recent call last): File "site.py", line 101, in main File "site.py", line 78, in run_entry_point File "site-packages\calibre\utils\ipc\worker.py", line 199, in main File "site-packages\calibre\gui2\convert\gui_conversion.py", line 35, in gui_convert_recipe File "site-packages\calibre\gui2\convert\gui_conversion.py", line 27, in gui_convert File "site-packages\calibre\ebooks\conversion\plumber.py", line 1106, in run File "site-packages\calibre\customize\conversion.py", line 244, in __call__ File "site-packages\calibre\ebooks\conversion\plugins\recipe_ input.py", line 135, in convert File "site-packages\calibre\web\feeds\news.py", line 901, in __init__ File "<string>", line 82, in get_browser File "site-packages\mechanize\_mechanize.py", line 254, in open File "site-packages\mechanize\_mechanize.py", line 310, in _mech_open mechanize._response.httperror_seek_wrapper: HTTP Error 401: Unauthorized

12-11-2018, 08:32 AM	#3
NSILMike Guru Posts: 735 Karma: 35936 Join Date: Apr 2011 Location: Shrewsury, MA Device: Lenovo Android Tablet	You can create a login, or log in with Facebook. And if you don't have a login how did you originally create the recipe? Last edited by NSILMike; 12-13-2018 at 11:05 AM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
WSJ recipe fails	mjfriedman	Recipes	13	10-17-2019 03:09 PM
Newsweek recipe now fails	NSILMike	Recipes	6	08-02-2017 07:40 PM
ESPN recipe broken due to new print urls	Odyseus	Recipes	1	01-18-2012 01:23 AM
Recipe works when mocked up as Python file, fails when converted to Recipe	ode	Recipes	7	09-04-2011 05:57 AM
ESPN Recipe is no longer carrying Soccernet	rylsfan	Recipes	2	02-24-2011 11:33 AM

12-11-2018, 01:18 AM	#2
kovidgoyal creator of calibre Posts: 46,054 Karma: 29579868 Join Date: Oct 2006 Location: Mumbai, India Device: Various	I need ESPN account credentials to look at that.

12-21-2018, 10:49 AM	#5
NSILMike Guru Posts: 735 Karma: 35936 Join Date: Apr 2011 Location: Shrewsury, MA Device: Lenovo Android Tablet	Just downloaded Calibre 3.36 which says ESPN recipe is improved. Now it doesn't fail, but it downloads only links...

12-21-2018, 01:14 PM	#6
kovidgoyal creator of calibre Posts: 46,054 Karma: 29579868 Join Date: Oct 2006 Location: Mumbai, India Device: Various	yeah I looked at it briefly, ESPN uses a complicated javascript based mechanism to login,which I dont have the time/interest to reverse engineer.

08-07-2020, 12:00 AM	#9
biffhero Junior Member Posts: 8 Karma: 10 Join Date: Aug 2020 Device: kobo libre h20	Well, I don't know how to know if it is working or not. It is not downloading because of age issues that I don't understand. I am getting this message a lot: Skipping article Bubbles are working for other sports. Why did the NFL decide against one? (Tue, 28 Jul, 2020 11:04) from feed www.espn.com - NFL as it is too old. I'll keep poking around for a cache somewhere. Thanks, Rob

08-07-2020, 12:12 AM	#10
kovidgoyal creator of calibre Posts: 46,054 Karma: 29579868 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Set oldest_article in the recipe to control that. And you dont need to look in builtin_recipes.zip to edit recipes, calibre has UI for that. https://manual.calibre-ebook.com/news.html

08-07-2020, 03:02 PM	#12
biffhero Junior Member Posts: 8 Karma: 10 Join Date: Aug 2020 Device: kobo libre h20	OK, I'm starting to understand how this stuff works. I think I'm making progress, but I'm not sure. The base URL has changed. feeds = [ ('Top Headlines', 'http://sports.espn.go.com/espn/rss/news'), 'http://sports.espn.go.com/espn/rss/nfl/news', 'http://sports.espn.go.com/espn/rss/nba/news', 'http://sports.espn.go.com/espn/rss/mlb/news', 'http://sports.espn.go.com/espn/rss/nhl/news', 'http://sports.espn.go.com/espn/rss/golf/news', 'http://sports.espn.go.com/espn/rss/rpm/news', 'http://sports.espn.go.com/espn/rss/tennis/news', 'http://sports.espn.go.com/espn/rss/boxing/news', 'http://soccernet.espn.go.com/rss/news', 'http://sports.espn.go.com/espn/rss/ncb/news', 'http://sports.espn.go.com/espn/rss/ncf/news', 'http://sports.espn.go.com/espn/rss/ncaa/news', 'http://sports.espn.go.com/espn/rss/outdoors/news', # 'http://sports.espn.go.com/espn/rss/bassmaster/news', 'http://sports.espn.go.com/espn/rss/oly/news', 'http://sports.espn.go.com/espn/rss/horse/news' ] Therefore, in print_version() we need return 'http://sports.espn.go.com/espn/print?' + match.group(1) + '&type=story' However, where I'm getting confused is where we get "match" setup. When we land inside of print_version, the variable "url" is holding the number. For instance, this is a good URL. https://www.espn.com/espn/print?id=29581539&type=story But the 'url' variable is coming in with '29581539', and the 'match' variable is completely empty. My current attempt has this in print_version(), which isn't working. def print_version(self, url): if 'eticket' in url: return url.partition('&')[0].replace('story?', 'print?') match = re.search(r'story\?(id=\d+)', url) self.log.debug('url: %s' % (url)) self.log.debug('match: %s' % (match.group(1))) match = 1 articleId = url if match and 'soccernet' not in url and 'bassmaster' not in url: # return 'http://sports.espn.go.com/espn/print?' + match.group(1) + '&type=story' self.log.debug('i: %s' % (match.group(1))) # https://www.espn.com/espn/print?id=29581539&type=story # return 'http://www.espn.com/espn/print?' + match.group(1) + '&type=story' I'll keep applying head to wall, but if this helps someone else get closer, that's good.

08-07-2020, 03:15 PM	#13
biffhero Junior Member Posts: 8 Karma: 10 Join Date: Aug 2020 Device: kobo libre h20	Sigh, that was completely wrong. Here is the correct information. ------------- OK, I'm starting to understand how this stuff works. I think I'm making progress, but I'm not sure. The base URL has changed. feeds = [ ('Top Headlines', 'https://www.espn.com/espn/rss/news'), 'https://www.espn.com/espn/rss/nfl/news', 'https://www.espn.com/espn/rss/nba/news', 'https://www.espn.com/espn/rss/mlb/news', 'https://www.espn.com/espn/rss/nhl/news', 'https://www.espn.com/espn/rss/golf/news', 'https://www.espn.com/espn/rss/rpm/news', 'https://www.espn.com/espn/rss/tennis/news', 'https://www.espn.com/espn/rss/boxing/news', 'https://www.espn.com/espn/rss/soccer/news', # 'http://soccernet.espn.go.com/rss/news', 'https://www.espn.com/espn/rss/ncb/news', 'https://www.espn.com/espn/rss/ncf/news', 'https://www.espn.com/espn/rss/ncaa/news', # 'https://www.espn.com/espn/rss/outdoors/news', # 'http://sports.espn.go.com/espn/rss/bassmaster/news', 'https://www.espn.com/espn/rss/oly/news', 'https://www.espn.com/espn/rss/horse/news' ] Therefore, in print_version() we need return 'http://www.espn.com/espn/print?id=' + articleId + '&type=story' However, where I'm getting confused is where we get "match" setup. When we land inside of print_version, the variable "url" is holding the number. For instance, this is a good URL. https://www.espn.com/espn/print?id=29581539&type=story But the 'url' variable is coming in with '29581539', and the 'match' variable is completely empty. My current attempt has this in print_version(), which isn't working. def print_version(self, url): if 'eticket' in url: return url.partition('&')[0].replace('story?', 'print?') match = re.search(r'story\?(id=\d+)', url) self.log.debug('url: %s' % (url)) self.log.debug('match: %s' % (match.group(1))) match = 1 articleId = url if match and 'soccernet' not in url and 'bassmaster' not in url: # return 'http://sports.espn.go.com/espn/print?' + match.group(1) + '&type=story' self.log.debug('i: %s' % (match.group(1))) # https://www.espn.com/espn/print?id=29581539&type=story # return 'http://www.espn.com/espn/print?' + match.group(1) + '&type=story' return 'http://www.espn.com/espn/print?id=' + articleId + '&type=story' I'll keep applying head to wall, but if this helps someone else get closer, that's good.

08-08-2020, 07:33 AM	#14
kovidgoyal creator of calibre Posts: 46,054 Karma: 29579868 Join Date: Oct 2006 Location: Mumbai, India Device: Various	There you go: https://github.com/kovidgoyal/calibr...1b1879f8d2d13f

08-20-2020, 09:48 PM	#15
biffhero Junior Member Posts: 8 Karma: 10 Join Date: Aug 2020 Device: kobo libre h20	I was out of town for a week, and I'm just getting back to this. This works perfectly, thank you! I have imported it to ESPN_master, and it works great. Thank you again, Rob