02-07-2011, 09:28 PM | #1 |
Junior Member
Posts: 5
Karma: 10
Join Date: Feb 2011
Device: kindle, nook, nookcolor, PDN
|
Help Fixing sltrib Recipe
Need help fixing a custom recipe for the SLC Tribune. The recipe somewhat works, but about every third news article is garbage. Also, when I tried to add the Technology section, it did not pull any of the articles. Not sure why not as the other sections work. The recipe is given below. Many thanks!!
SLTRIB RECIPE: Code:
from calibre.web.feeds.news import BasicNewsRecipe class AdvancedUserRecipe1278347258(BasicNewsRecipe): title = u'Salt Lake City Tribune' __author__ = 'Charles Holbert' oldest_article = 1 max_articles_per_feed = 100 description = '''Utah's independent news source since 1871''' publisher = 'http://www.sltrib.com/' category = 'news, Utah, SLC' language = 'en' encoding = 'utf-8' remove_javascript = True use_embedded_content = False no_stylesheets = True remove_tags = [dict(name='div',attrs={'id':['teaser','adCol', 'keywordStories']}) ,dict(name='div',attrs={'class':'tripleWide datos'})] keep_only_tags = [dict(name='div',attrs={'class':'theImage'}) ,dict(name='div',attrs={'id':'topImageCaption'}) ,dict(name='div',attrs={'class':'theHeadline entry-title'}) ,dict(name='div',attrs={'class':'byline'}) ,dict(name='div',attrs={'id':'storytext'})] feeds = [(u'SL Tribune Today', u'http://www.sltrib.com/csp/cms/sites/sltrib/RSS/rss.csp?cat=All'), (u'Utah News', u'http://www.sltrib.com/csp/cms/sites/sltrib/RSS/rss.csp?cat=UtahNews'), (u'Business News', u'http://www.sltrib.com/csp/cms/sites/sltrib/RSS/rss.csp?cat=Money'), (u'Most Popular', u'http://www.sltrib.com/csp/cms/sites/sltrib/RSS/rsspopular.csp'), (u'Sports', u'http://www.sltrib.com/csp/cms/sites/sltrib/RSS/rss.csp?cat=Sports')] extra_css = ''' .theHeadline{font-family:Arial,Helvetica,sans-serif; font-size:xx-large; font-weight: bold; color:#0E5398;} .byline{font-family:Arial,Helvetica,sans-serif; color:#333333; font-size:xx-small;} .storytext{font-family:Arial,Helvetica,sans-serif; font-size:medium;} .articleText{font-family:Arial,Helvetica,sans-serif; font-size:medium;} .caption{font-family:Arial,Helvetica,sans-serif; font-size:xx-small; margin-bottom: 1em;} ''' Last edited by kovidgoyal; 02-07-2011 at 10:06 PM. |
02-10-2011, 12:39 PM | #2 |
Junior Member
Posts: 5
Karma: 10
Join Date: Feb 2011
Device: kindle, nook, nookcolor, PDN
|
Added delay=1 and that seemed to help reduce the number of articles that were showing up as garbage. Also added a code to download a cover page. If anyone is interested, the recipe is posted below. Still looking for advice on how to solve the remaining problems.
from calibre.web.feeds.news import BasicNewsRecipe class AdvancedUserRecipe1278347258(BasicNewsRecipe): title = u'Salt Lake City Tribune' __author__ = 'Charles Holbert' oldest_article = 1 max_articles_per_feed = 100 description = '''Utah's independent news source since 1871''' publisher = 'http://www.sltrib.com/' category = 'news, Utah, SLC' language = 'en' encoding = 'utf-8' delay = 1 #simultaneous_downloads = 1 remove_javascript = True use_embedded_content = False no_stylesheets = True #masthead_url = 'http://www.sltrib.com/csp/cms/sites/sltrib/assets/images/logo_main.png' #cover_url = 'http://webmedia.newseum.org/newseum-multimedia/dfp/jpg9/lg/UT_SLT.jpg' remove_tags = [dict(name='div',attrs={'id':['teaser','adCol', 'keywordStories']}) ,dict(name='div',attrs={'class':'tripleWide datos'})] keep_only_tags = [dict(name='div',attrs={'class':'theImage'}) ,dict(name='div',attrs={'id':'topImageCaption'}) ,dict(name='div',attrs={'class':'theHeadline entry-title'}) ,dict(name='div',attrs={'class':'byline'}) ,dict(name='div',attrs={'id':'storytext'})] feeds = [(u'SL Tribune Today', u'http://www.sltrib.com/csp/cms/sites/sltrib/RSS/rss.csp?cat=All'), (u'Utah News', u'http://www.sltrib.com/csp/cms/sites/sltrib/RSS/rss.csp?cat=UtahNews'), (u'Business News', u'http://www.sltrib.com/csp/cms/sites/sltrib/RSS/rss.csp?cat=Money'), (u'Most Popular', u'http://www.sltrib.com/csp/cms/sites/sltrib/RSS/rsspopular.csp'), (u'Sports', u'http://www.sltrib.com/csp/cms/sites/sltrib/RSS/rss.csp?cat=Sports')] extra_css = ''' .theHeadline{font-family:Arial,Helvetica,sans-serif; font-size:xx-large; font-weight: bold; color:#0E5398;} .byline{font-family:Arial,Helvetica,sans-serif; color:#333333; font-size:xx-small;} .storytext{font-family:Arial,Helvetica,sans-serif; font-size:medium;} .articleText{font-family:Arial,Helvetica,sans-serif; font-size:medium;} .caption{font-family:Arial,Helvetica,sans-serif; font-size:xx-small; margin-bottom: 1em;} ''' def get_cover_url(self): cover_url = None href = 'http://www.newseum.org/todaysfrontpages/hr.asp?fpVname=UT_SLT&ref_pge=lst' soup = self.index_to_soup(href) div = soup.find('div',attrs={'class':'tfpLrgView_contain er'}) if div: cover_url = div.img['src'] return cover_url |
Advert | |
|
02-10-2011, 02:07 PM | #3 |
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
|
02-10-2011, 09:44 PM | #4 |
Junior Member
Posts: 5
Karma: 10
Join Date: Feb 2011
Device: kindle, nook, nookcolor, PDN
|
The problems do not occur with the same article - changes to different articles upon re-downloading. The simultaneous_downloads =1 did not help. However, I think it defaults to 1 when delay > 0 so no surprises that it did not help.
|
02-11-2011, 10:23 AM | #5 | |
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Quote:
Or it might be a bad connection, or server-side limits/issues. Capturing the raw html and the live http headers may help track it down. (I was wondering if simultaneous_downloads =1 helped when you weren't adding delay > 0 ) |
|
Advert | |
|
02-12-2011, 11:07 AM | #6 |
Junior Member
Posts: 5
Karma: 10
Join Date: Feb 2011
Device: kindle, nook, nookcolor, PDN
|
Implemented suggested fix based on Kovid's code:
def get_browser(self): br = BasicNewsRecipe.get_browser(self) br.set_handle_refresh(False) return br But this did not solve the problem. So, decided to grab the print version of the articles and parse those and now everything works. If interested I can post the recipe for others. |
03-07-2011, 03:28 PM | #7 |
Junior Member
Posts: 1
Karma: 10
Join Date: Mar 2011
Device: kindle
|
cfholbert, please post the recipe. i'm interested.
|
03-11-2011, 02:26 AM | #8 |
Junior Member
Posts: 5
Karma: 10
Join Date: Feb 2011
Device: kindle, nook, nookcolor, PDN
|
Recipe is attached.
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Fixing Up Typography | ahi | Workshop | 65 | 11-18-2013 04:35 AM |
Recipe works when mocked up as Python file, fails when converted to Recipe | ode | Recipes | 7 | 09-04-2011 04:57 AM |
Fixing broken sentences. | Vanguard3000 | Sigil | 18 | 01-23-2011 12:45 PM |
Fixing exception in the_age recipe | petdr | Recipes | 1 | 01-22-2011 02:25 AM |
help with regex for fixing misspellings please | cybmole | Sigil | 2 | 01-11-2011 08:02 AM |