Help Fixing sltrib Recipe

cfholbert · 02-07-2011, 09:28 PM

Need help fixing a custom recipe for the SLC Tribune. The recipe somewhat works, but about every third news article is garbage. Also, when I tried to add the Technology section, it did not pull any of the articles. Not sure why not as the other sections work. The recipe is given below. Many thanks!!

SLTRIB RECIPE:

Code:

from calibre.web.feeds.news import BasicNewsRecipe

class AdvancedUserRecipe1278347258(BasicNewsRecipe):
    title      = u'Salt Lake City Tribune'
    __author__ = 'Charles Holbert'
    oldest_article = 1
    max_articles_per_feed = 100

    description           = '''Utah's independent news source since 1871'''
    publisher             = 'http://www.sltrib.com/'
    category              = 'news, Utah, SLC'
    language              = 'en'
    encoding              = 'utf-8'
    remove_javascript     = True
    use_embedded_content  = False
    no_stylesheets        = True

    remove_tags = [dict(name='div',attrs={'id':['teaser','adCol', 'keywordStories']})
                  ,dict(name='div',attrs={'class':'tripleWide datos'})]


    keep_only_tags = [dict(name='div',attrs={'class':'theImage'})
                      ,dict(name='div',attrs={'id':'topImageCaption'})
                      ,dict(name='div',attrs={'class':'theHeadline entry-title'})
                      ,dict(name='div',attrs={'class':'byline'})
                      ,dict(name='div',attrs={'id':'storytext'})]

    feeds = [(u'SL Tribune Today', u'http://www.sltrib.com/csp/cms/sites/sltrib/RSS/rss.csp?cat=All'),
	       (u'Utah News', u'http://www.sltrib.com/csp/cms/sites/sltrib/RSS/rss.csp?cat=UtahNews'),
	       (u'Business News', u'http://www.sltrib.com/csp/cms/sites/sltrib/RSS/rss.csp?cat=Money'),
	       (u'Most Popular', u'http://www.sltrib.com/csp/cms/sites/sltrib/RSS/rsspopular.csp'),
	       (u'Sports', u'http://www.sltrib.com/csp/cms/sites/sltrib/RSS/rss.csp?cat=Sports')]

    extra_css = '''
                .theHeadline{font-family:Arial,Helvetica,sans-serif; font-size:xx-large; font-weight: bold; color:#0E5398;}
                .byline{font-family:Arial,Helvetica,sans-serif; color:#333333; font-size:xx-small;}
                .storytext{font-family:Arial,Helvetica,sans-serif; font-size:medium;}
                .articleText{font-family:Arial,Helvetica,sans-serif; font-size:medium;}
                .caption{font-family:Arial,Helvetica,sans-serif; font-size:xx-small; margin-bottom: 1em;}
                '''

cfholbert · 02-10-2011, 12:39 PM

Added delay=1 and that seemed to help reduce the number of articles that were showing up as garbage. Also added a code to download a cover page. If anyone is interested, the recipe is posted below. Still looking for advice on how to solve the remaining problems.

from calibre.web.feeds.news import BasicNewsRecipe

class AdvancedUserRecipe1278347258(BasicNewsRecipe):
title = u'Salt Lake City Tribune'
__author__ = 'Charles Holbert'
oldest_article = 1
max_articles_per_feed = 100

description = '''Utah's independent news source since 1871'''
publisher = 'http://www.sltrib.com/'
category = 'news, Utah, SLC'
language = 'en'
encoding = 'utf-8'
delay = 1
#simultaneous_downloads = 1
remove_javascript = True
use_embedded_content = False
no_stylesheets = True

#masthead_url = 'http://www.sltrib.com/csp/cms/sites/sltrib/assets/images/logo_main.png'
#cover_url = 'http://webmedia.newseum.org/newseum-multimedia/dfp/jpg9/lg/UT_SLT.jpg'

remove_tags = [dict(name='div',attrs={'id':['teaser','adCol', 'keywordStories']})
,dict(name='div',attrs={'class':'tripleWide datos'})]

keep_only_tags = [dict(name='div',attrs={'class':'theImage'})
,dict(name='div',attrs={'id':'topImageCaption'})
,dict(name='div',attrs={'class':'theHeadline entry-title'})
,dict(name='div',attrs={'class':'byline'})
,dict(name='div',attrs={'id':'storytext'})]

feeds = [(u'SL Tribune Today', u'http://www.sltrib.com/csp/cms/sites/sltrib/RSS/rss.csp?cat=All'),
(u'Utah News', u'http://www.sltrib.com/csp/cms/sites/sltrib/RSS/rss.csp?cat=UtahNews'),
(u'Business News', u'http://www.sltrib.com/csp/cms/sites/sltrib/RSS/rss.csp?cat=Money'),
(u'Most Popular', u'http://www.sltrib.com/csp/cms/sites/sltrib/RSS/rsspopular.csp'),
(u'Sports', u'http://www.sltrib.com/csp/cms/sites/sltrib/RSS/rss.csp?cat=Sports')]

extra_css = '''
.theHeadline{font-family:Arial,Helvetica,sans-serif; font-size:xx-large; font-weight: bold; color:#0E5398;}
.byline{font-family:Arial,Helvetica,sans-serif; color:#333333; font-size:xx-small;}
.storytext{font-family:Arial,Helvetica,sans-serif; font-size:medium;}
.articleText{font-family:Arial,Helvetica,sans-serif; font-size:medium;}
.caption{font-family:Arial,Helvetica,sans-serif; font-size:xx-small; margin-bottom: 1em;}
'''

def get_cover_url(self):
cover_url = None
href = 'http://www.newseum.org/todaysfrontpages/hr.asp?fpVname=UT_SLT&ref_pge=lst'
soup = self.index_to_soup(href)
div = soup.find('div',attrs={'class':'tfpLrgView_contain er'})
if div:
cover_url = div.img['src']
return cover_url

Starson17 · 02-10-2011, 02:07 PM

Quote:

Originally Posted by cfholbert

Added delay=1 and that seemed to help

Is the same article always the problem, or does it vary? Did simultaneous_downloads = 1 help at all? Is it just your connection, or do others see it?

cfholbert · 02-10-2011, 09:44 PM

The problems do not occur with the same article - changes to different articles upon re-downloading. The simultaneous_downloads =1 did not help. However, I think it defaults to 1 when delay > 0 so no surprises that it did not help.

Starson17 · 02-11-2011, 10:23 AM

Quote:

Originally Posted by cfholbert

The problems do not occur with the same article - changes to different articles upon re-downloading. The simultaneous_downloads =1 did not help. However, I think it defaults to 1 when delay > 0 so no surprises that it did not help.

It sounds sort of like a problem I encountered with advertisements that were randomly inserted in front of articles in an RSS feed. I had a solution (that I don't have time to search out, but which I posted here) and IIRC, Kovid posted a better alternative here, as well. Some searching should eventually find it.

Or it might be a bad connection, or server-side limits/issues. Capturing the raw html and the live http headers may help track it down. (I was wondering if simultaneous_downloads =1 helped when you weren't adding delay > 0 )

cfholbert · 02-12-2011, 11:07 AM

Implemented suggested fix based on Kovid's code:

def get_browser(self):
br = BasicNewsRecipe.get_browser(self)
br.set_handle_refresh(False)
return br

But this did not solve the problem. So, decided to grab the print version of the articles and parse those and now everything works. If interested I can post the recipe for others.

tcfpcm · 03-07-2011, 03:28 PM

cfholbert, please post the recipe. i'm interested.

cfholbert · 03-11-2011, 02:26 AM

Recipe is attached.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Fixing Up Typography	ahi	Workshop	65	11-18-2013 04:35 AM
Recipe works when mocked up as Python file, fails when converted to Recipe	ode	Recipes	7	09-04-2011 04:57 AM
Fixing broken sentences.	Vanguard3000	Sigil	18	01-23-2011 12:45 PM
Fixing exception in the_age recipe	petdr	Recipes	1	01-22-2011 02:25 AM
help with regex for fixing misspellings please	cybmole	Sigil	2	01-11-2011 08:02 AM

02-10-2011, 12:39 PM	#2
cfholbert Junior Member Posts: 5 Karma: 10 Join Date: Feb 2011 Device: kindle, nook, nookcolor, PDN	Added delay=1 and that seemed to help reduce the number of articles that were showing up as garbage. Also added a code to download a cover page. If anyone is interested, the recipe is posted below. Still looking for advice on how to solve the remaining problems. from calibre.web.feeds.news import BasicNewsRecipe class AdvancedUserRecipe1278347258(BasicNewsRecipe): title = u'Salt Lake City Tribune' __author__ = 'Charles Holbert' oldest_article = 1 max_articles_per_feed = 100 description = '''Utah's independent news source since 1871''' publisher = 'http://www.sltrib.com/' category = 'news, Utah, SLC' language = 'en' encoding = 'utf-8' delay = 1 #simultaneous_downloads = 1 remove_javascript = True use_embedded_content = False no_stylesheets = True #masthead_url = 'http://www.sltrib.com/csp/cms/sites/sltrib/assets/images/logo_main.png' #cover_url = 'http://webmedia.newseum.org/newseum-multimedia/dfp/jpg9/lg/UT_SLT.jpg' remove_tags = [dict(name='div',attrs={'id':['teaser','adCol', 'keywordStories']}) ,dict(name='div',attrs={'class':'tripleWide datos'})] keep_only_tags = [dict(name='div',attrs={'class':'theImage'}) ,dict(name='div',attrs={'id':'topImageCaption'}) ,dict(name='div',attrs={'class':'theHeadline entry-title'}) ,dict(name='div',attrs={'class':'byline'}) ,dict(name='div',attrs={'id':'storytext'})] feeds = [(u'SL Tribune Today', u'http://www.sltrib.com/csp/cms/sites/sltrib/RSS/rss.csp?cat=All'), (u'Utah News', u'http://www.sltrib.com/csp/cms/sites/sltrib/RSS/rss.csp?cat=UtahNews'), (u'Business News', u'http://www.sltrib.com/csp/cms/sites/sltrib/RSS/rss.csp?cat=Money'), (u'Most Popular', u'http://www.sltrib.com/csp/cms/sites/sltrib/RSS/rsspopular.csp'), (u'Sports', u'http://www.sltrib.com/csp/cms/sites/sltrib/RSS/rss.csp?cat=Sports')] extra_css = ''' .theHeadline{font-family:Arial,Helvetica,sans-serif; font-size:xx-large; font-weight: bold; color:#0E5398;} .byline{font-family:Arial,Helvetica,sans-serif; color:#333333; font-size:xx-small;} .storytext{font-family:Arial,Helvetica,sans-serif; font-size:medium;} .articleText{font-family:Arial,Helvetica,sans-serif; font-size:medium;} .caption{font-family:Arial,Helvetica,sans-serif; font-size:xx-small; margin-bottom: 1em;} ''' def get_cover_url(self): cover_url = None href = 'http://www.newseum.org/todaysfrontpages/hr.asp?fpVname=UT_SLT&ref_pge=lst' soup = self.index_to_soup(href) div = soup.find('div',attrs={'class':'tfpLrgView_contain er'}) if div: cover_url = div.img['src'] return cover_url

02-10-2011, 09:44 PM	#4
cfholbert Junior Member Posts: 5 Karma: 10 Join Date: Feb 2011 Device: kindle, nook, nookcolor, PDN	The problems do not occur with the same article - changes to different articles upon re-downloading. The simultaneous_downloads =1 did not help. However, I think it defaults to 1 when delay > 0 so no surprises that it did not help.

02-12-2011, 11:07 AM	#6
cfholbert Junior Member Posts: 5 Karma: 10 Join Date: Feb 2011 Device: kindle, nook, nookcolor, PDN	Implemented suggested fix based on Kovid's code: def get_browser(self): br = BasicNewsRecipe.get_browser(self) br.set_handle_refresh(False) return br But this did not solve the problem. So, decided to grab the print version of the articles and parse those and now everything works. If interested I can post the recipe for others.

03-07-2011, 03:28 PM	#7
tcfpcm Junior Member Posts: 1 Karma: 10 Join Date: Mar 2011 Device: kindle	cfholbert, please post the recipe. i'm interested.

Advert

Advert