Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 02-07-2011, 09:28 PM   #1
cfholbert
Junior Member
cfholbert began at the beginning.
 
Posts: 5
Karma: 10
Join Date: Feb 2011
Device: kindle, nook, nookcolor, PDN
Help Fixing sltrib Recipe

Need help fixing a custom recipe for the SLC Tribune. The recipe somewhat works, but about every third news article is garbage. Also, when I tried to add the Technology section, it did not pull any of the articles. Not sure why not as the other sections work. The recipe is given below. Many thanks!!

SLTRIB RECIPE:

Code:
from calibre.web.feeds.news import BasicNewsRecipe

class AdvancedUserRecipe1278347258(BasicNewsRecipe):
    title      = u'Salt Lake City Tribune'
    __author__ = 'Charles Holbert'
    oldest_article = 1
    max_articles_per_feed = 100

    description           = '''Utah's independent news source since 1871'''
    publisher             = 'http://www.sltrib.com/'
    category              = 'news, Utah, SLC'
    language              = 'en'
    encoding              = 'utf-8'
    remove_javascript     = True
    use_embedded_content  = False
    no_stylesheets        = True

    remove_tags = [dict(name='div',attrs={'id':['teaser','adCol', 'keywordStories']})
                  ,dict(name='div',attrs={'class':'tripleWide datos'})]


    keep_only_tags = [dict(name='div',attrs={'class':'theImage'})
                      ,dict(name='div',attrs={'id':'topImageCaption'})
                      ,dict(name='div',attrs={'class':'theHeadline entry-title'})
                      ,dict(name='div',attrs={'class':'byline'})
                      ,dict(name='div',attrs={'id':'storytext'})]

    feeds = [(u'SL Tribune Today', u'http://www.sltrib.com/csp/cms/sites/sltrib/RSS/rss.csp?cat=All'),
	       (u'Utah News', u'http://www.sltrib.com/csp/cms/sites/sltrib/RSS/rss.csp?cat=UtahNews'),
	       (u'Business News', u'http://www.sltrib.com/csp/cms/sites/sltrib/RSS/rss.csp?cat=Money'),
	       (u'Most Popular', u'http://www.sltrib.com/csp/cms/sites/sltrib/RSS/rsspopular.csp'),
	       (u'Sports', u'http://www.sltrib.com/csp/cms/sites/sltrib/RSS/rss.csp?cat=Sports')]

    extra_css = '''
                .theHeadline{font-family:Arial,Helvetica,sans-serif; font-size:xx-large; font-weight: bold; color:#0E5398;}
                .byline{font-family:Arial,Helvetica,sans-serif; color:#333333; font-size:xx-small;}
                .storytext{font-family:Arial,Helvetica,sans-serif; font-size:medium;}
                .articleText{font-family:Arial,Helvetica,sans-serif; font-size:medium;}
                .caption{font-family:Arial,Helvetica,sans-serif; font-size:xx-small; margin-bottom: 1em;}
                '''

Last edited by kovidgoyal; 02-07-2011 at 10:06 PM.
cfholbert is offline   Reply With Quote
Old 02-10-2011, 12:39 PM   #2
cfholbert
Junior Member
cfholbert began at the beginning.
 
Posts: 5
Karma: 10
Join Date: Feb 2011
Device: kindle, nook, nookcolor, PDN
Added delay=1 and that seemed to help reduce the number of articles that were showing up as garbage. Also added a code to download a cover page. If anyone is interested, the recipe is posted below. Still looking for advice on how to solve the remaining problems.


from calibre.web.feeds.news import BasicNewsRecipe

class AdvancedUserRecipe1278347258(BasicNewsRecipe):
title = u'Salt Lake City Tribune'
__author__ = 'Charles Holbert'
oldest_article = 1
max_articles_per_feed = 100

description = '''Utah's independent news source since 1871'''
publisher = 'http://www.sltrib.com/'
category = 'news, Utah, SLC'
language = 'en'
encoding = 'utf-8'
delay = 1
#simultaneous_downloads = 1
remove_javascript = True
use_embedded_content = False
no_stylesheets = True

#masthead_url = 'http://www.sltrib.com/csp/cms/sites/sltrib/assets/images/logo_main.png'
#cover_url = 'http://webmedia.newseum.org/newseum-multimedia/dfp/jpg9/lg/UT_SLT.jpg'

remove_tags = [dict(name='div',attrs={'id':['teaser','adCol', 'keywordStories']})
,dict(name='div',attrs={'class':'tripleWide datos'})]

keep_only_tags = [dict(name='div',attrs={'class':'theImage'})
,dict(name='div',attrs={'id':'topImageCaption'})
,dict(name='div',attrs={'class':'theHeadline entry-title'})
,dict(name='div',attrs={'class':'byline'})
,dict(name='div',attrs={'id':'storytext'})]

feeds = [(u'SL Tribune Today', u'http://www.sltrib.com/csp/cms/sites/sltrib/RSS/rss.csp?cat=All'),
(u'Utah News', u'http://www.sltrib.com/csp/cms/sites/sltrib/RSS/rss.csp?cat=UtahNews'),
(u'Business News', u'http://www.sltrib.com/csp/cms/sites/sltrib/RSS/rss.csp?cat=Money'),
(u'Most Popular', u'http://www.sltrib.com/csp/cms/sites/sltrib/RSS/rsspopular.csp'),
(u'Sports', u'http://www.sltrib.com/csp/cms/sites/sltrib/RSS/rss.csp?cat=Sports')]

extra_css = '''
.theHeadline{font-family:Arial,Helvetica,sans-serif; font-size:xx-large; font-weight: bold; color:#0E5398;}
.byline{font-family:Arial,Helvetica,sans-serif; color:#333333; font-size:xx-small;}
.storytext{font-family:Arial,Helvetica,sans-serif; font-size:medium;}
.articleText{font-family:Arial,Helvetica,sans-serif; font-size:medium;}
.caption{font-family:Arial,Helvetica,sans-serif; font-size:xx-small; margin-bottom: 1em;}
'''

def get_cover_url(self):
cover_url = None
href = 'http://www.newseum.org/todaysfrontpages/hr.asp?fpVname=UT_SLT&ref_pge=lst'
soup = self.index_to_soup(href)
div = soup.find('div',attrs={'class':'tfpLrgView_contain er'})
if div:
cover_url = div.img['src']
return cover_url
cfholbert is offline   Reply With Quote
Advert
Old 02-10-2011, 02:07 PM   #3
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by cfholbert View Post
Added delay=1 and that seemed to help
Is the same article always the problem, or does it vary? Did simultaneous_downloads = 1 help at all? Is it just your connection, or do others see it?
Starson17 is offline   Reply With Quote
Old 02-10-2011, 09:44 PM   #4
cfholbert
Junior Member
cfholbert began at the beginning.
 
Posts: 5
Karma: 10
Join Date: Feb 2011
Device: kindle, nook, nookcolor, PDN
The problems do not occur with the same article - changes to different articles upon re-downloading. The simultaneous_downloads =1 did not help. However, I think it defaults to 1 when delay > 0 so no surprises that it did not help.
cfholbert is offline   Reply With Quote
Old 02-11-2011, 10:23 AM   #5
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by cfholbert View Post
The problems do not occur with the same article - changes to different articles upon re-downloading. The simultaneous_downloads =1 did not help. However, I think it defaults to 1 when delay > 0 so no surprises that it did not help.
It sounds sort of like a problem I encountered with advertisements that were randomly inserted in front of articles in an RSS feed. I had a solution (that I don't have time to search out, but which I posted here) and IIRC, Kovid posted a better alternative here, as well. Some searching should eventually find it.

Or it might be a bad connection, or server-side limits/issues. Capturing the raw html and the live http headers may help track it down. (I was wondering if simultaneous_downloads =1 helped when you weren't adding delay > 0 )
Starson17 is offline   Reply With Quote
Advert
Old 02-12-2011, 11:07 AM   #6
cfholbert
Junior Member
cfholbert began at the beginning.
 
Posts: 5
Karma: 10
Join Date: Feb 2011
Device: kindle, nook, nookcolor, PDN
Implemented suggested fix based on Kovid's code:

def get_browser(self):
br = BasicNewsRecipe.get_browser(self)
br.set_handle_refresh(False)
return br

But this did not solve the problem. So, decided to grab the print version of the articles and parse those and now everything works. If interested I can post the recipe for others.
cfholbert is offline   Reply With Quote
Old 03-07-2011, 03:28 PM   #7
tcfpcm
Junior Member
tcfpcm began at the beginning.
 
Posts: 1
Karma: 10
Join Date: Mar 2011
Device: kindle
cfholbert, please post the recipe. i'm interested.
tcfpcm is offline   Reply With Quote
Old 03-11-2011, 02:26 AM   #8
cfholbert
Junior Member
cfholbert began at the beginning.
 
Posts: 5
Karma: 10
Join Date: Feb 2011
Device: kindle, nook, nookcolor, PDN
Recipe is attached.
Attached Files
File Type: zip sltrib.zip (1.1 KB, 154 views)
cfholbert is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Fixing Up Typography ahi Workshop 65 11-18-2013 04:35 AM
Recipe works when mocked up as Python file, fails when converted to Recipe ode Recipes 7 09-04-2011 04:57 AM
Fixing broken sentences. Vanguard3000 Sigil 18 01-23-2011 12:45 PM
Fixing exception in the_age recipe petdr Recipes 1 01-22-2011 02:25 AM
help with regex for fixing misspellings please cybmole Sigil 2 01-11-2011 08:02 AM


All times are GMT -4. The time now is 03:36 PM.


MobileRead.com is a privately owned, operated and funded community.