MobileRead Forums - View Single Post - How do I get rid of this duplicate content?

kbookie · 07-15-2011, 11:44 PM

Thanks for the speedy reply, Starson17.

I looked at it a couple more times and finally figured it out:

I was asking for the class called fullstory instead of the ID called fullstory

Now the FULL recipe code looks like this and works like a charm. Not yet fancy, but it gets me the stories, anyway...

Code:

__license__   = 'GPL v3'
__copyright__ = '2011, Roger Easlick <roger.easlick at gmail.com>'
'''
theoaklandpress.com
'''
import re 
from calibre.web.feeds.recipes import BasicNewsRecipe

class Oakland_Press(BasicNewsRecipe):
    title                  = 'The Oakland Press'
    __author__             = 'Roger Easlick'
    description            = 'Oakland County News '
    oldest_article         = 2
    max_articles_per_feed  = 100
    no_stylesheets         = True
    #delay                  = 1
    use_embedded_content   = False
    encoding               = 'utf8'
    publisher              = 'The Oakland Press'
    category               = 'news'
    language               = 'en_US'
    publication_type       = 'newsportal'
    extra_css              = ' body{ font-family: Verdana,Helvetica,Arial,sans-serif } .introduction{font-weight: bold} .story-feature{display: block; padding: 0; border: 1px solid; width: 40%; font-size: small} .story-feature h2{text-align: center; text-transform: uppercase} '
    preprocess_regexps     = [(re.compile(r'<!--.*?-->', re.DOTALL), lambda m: '')]
    conversion_options = {
                             'comments'        : description
                            ,'tags'            : category
                            ,'language'        : language
                            ,'publisher'       : publisher
                            ,'linearize_tables': True
                         }

    keep_only_tags    = [
                       dict(name='div', attrs={'class':['story_headline']})
                       ,dict(name='div', attrs={'class':['story_timestamp']})
		       ,dict(name='div', attrs={'id':['fullstory']})
                        ]

    remove_tags = [
                       dict(name='div', attrs={'class':['comments-link-block']})
                       ,dict(name='ul', attrs={'id':['paging']})
		        ]


    remove_attributes = ['width','height']

    feeds          = [
                      ('News', 'http://www.theoaklandpress.com/?rss=news'),
                      ('Sports', 'http://www.theoaklandpress.com/?rss=sports'),
                      ('Business', 'http://business-news.thestreet.com/the-oakland-press/rss/109411'),
                      ('Personal Finance', 'http://business-news.thestreet.com/the-oakland-press/rss/627'),
                      ('Investing Tips', 'http://business-news.thestreet.com/the-oakland-press/rss/117429'),
                      ('Mobile & Gadgets', 'http://business-news.thestreet.com/the-oakland-press/rss/115115'),
                      ('Energy & Green', 'http://business-news.thestreet.com/the-oakland-press/rss/117435'),
                      ('Opinion', 'http://www.theoaklandpress.com/?rss=opinion'),
                      ('Entertainment', 'http://www.theoaklandpress.com/?rss=entertainment'),
                      ('Life', 'http://www.theoaklandpress.com/?rss=life'),
                      ('Luxury & Leisure', 'http://business-news.thestreet.com/the-oakland-press/rss/68877'),
                      ('Obituaries', 'http://www.legacy.com/obituaries/theoaklandpress/services/rss.ashx'),
                    ]

    def print_version(self, url):
      return url + '?viewmode=fullstory'