How do I get rid of this duplicate content?

kbookie · 07-15-2011, 04:06 AM

I modified Darko's BBC script to get the full story version of The Oakland Press (Oakland County Michigan).

In the debug, it seems to be fetching the fullstory version, but the HTML is of the paged version.

Can anyone tell me how to get and keep only the full version so that I don't have any duplicate content?

Code:

'''
theoaklandpress.com
'''
import re 
from calibre.web.feeds.recipes import BasicNewsRecipe

class Oakland_Press(BasicNewsRecipe):
    title                  = 'The Oakland Press'
    __author__             = 'Roger Easlick'
    description            = 'Oakland County News '
    oldest_article         = 2
    max_articles_per_feed  = 100
    no_stylesheets         = True
    #delay                  = 1
    use_embedded_content   = False
    encoding               = 'utf8'
    publisher              = 'The Oakland Press'
    category               = 'news'
    language               = 'en_US'
    publication_type       = 'newsportal'
    extra_css              = ' body{ font-family: Verdana,Helvetica,Arial,sans-serif } .introduction{font-weight: bold} .story-feature{display: block; padding: 0; border: 1px solid; width: 40%; font-size: small} .story-feature h2{text-align: center; text-transform: uppercase} '
    preprocess_regexps     = [(re.compile(r'<!--.*?-->', re.DOTALL), lambda m: '')]
    conversion_options = {
                             'comments'        : description
                            ,'tags'            : category
                            ,'language'        : language
                            ,'publisher'       : publisher
                            ,'linearize_tables': True
                         }

    keep_only_tags    = [
                       dict(name='div', attrs={'class':['story_headline']})
                       ,dict(name='div', attrs={'class':['story_timestamp']})
                       ,dict(name='p', attrs={'class':['byline']})  
		       ,dict(name='div', attrs={'class':['story_body clear']})
                        ]

    remove_tags = [
                       dict(name='div', attrs={'class':['comments-link-block']})
                       ,dict(name='ul', attrs={'id':['paging']})
		        ]


    remove_attributes = ['width','height']

    feeds          = [
                      ('News', 'http://www.theoaklandpress.com/?rss=news'),
                    ]

    def print_version(self, url):
      return url + '?viewmode=fullstory'

Any help would be greatly appreciated!

Starson17 · 07-15-2011, 09:21 AM

It's not clear to me what the problem is. At first glance, the code looks correct to me. It's possible that the site won't let you grab the full story link until after you've gotten the paged version. I'd be printing out the entire soup and see what the site is sending you (if you haven't already done that.) One alternative solution would be to write a multipage version of the recipe.

kbookie · 07-15-2011, 11:44 PM

Thanks for the speedy reply, Starson17.

I looked at it a couple more times and finally figured it out:

I was asking for the class called fullstory instead of the ID called fullstory

Now the FULL recipe code looks like this and works like a charm. Not yet fancy, but it gets me the stories, anyway...

Code:

__license__   = 'GPL v3'
__copyright__ = '2011, Roger Easlick <roger.easlick at gmail.com>'
'''
theoaklandpress.com
'''
import re 
from calibre.web.feeds.recipes import BasicNewsRecipe

class Oakland_Press(BasicNewsRecipe):
    title                  = 'The Oakland Press'
    __author__             = 'Roger Easlick'
    description            = 'Oakland County News '
    oldest_article         = 2
    max_articles_per_feed  = 100
    no_stylesheets         = True
    #delay                  = 1
    use_embedded_content   = False
    encoding               = 'utf8'
    publisher              = 'The Oakland Press'
    category               = 'news'
    language               = 'en_US'
    publication_type       = 'newsportal'
    extra_css              = ' body{ font-family: Verdana,Helvetica,Arial,sans-serif } .introduction{font-weight: bold} .story-feature{display: block; padding: 0; border: 1px solid; width: 40%; font-size: small} .story-feature h2{text-align: center; text-transform: uppercase} '
    preprocess_regexps     = [(re.compile(r'<!--.*?-->', re.DOTALL), lambda m: '')]
    conversion_options = {
                             'comments'        : description
                            ,'tags'            : category
                            ,'language'        : language
                            ,'publisher'       : publisher
                            ,'linearize_tables': True
                         }

    keep_only_tags    = [
                       dict(name='div', attrs={'class':['story_headline']})
                       ,dict(name='div', attrs={'class':['story_timestamp']})
		       ,dict(name='div', attrs={'id':['fullstory']})
                        ]

    remove_tags = [
                       dict(name='div', attrs={'class':['comments-link-block']})
                       ,dict(name='ul', attrs={'id':['paging']})
		        ]


    remove_attributes = ['width','height']

    feeds          = [
                      ('News', 'http://www.theoaklandpress.com/?rss=news'),
                      ('Sports', 'http://www.theoaklandpress.com/?rss=sports'),
                      ('Business', 'http://business-news.thestreet.com/the-oakland-press/rss/109411'),
                      ('Personal Finance', 'http://business-news.thestreet.com/the-oakland-press/rss/627'),
                      ('Investing Tips', 'http://business-news.thestreet.com/the-oakland-press/rss/117429'),
                      ('Mobile & Gadgets', 'http://business-news.thestreet.com/the-oakland-press/rss/115115'),
                      ('Energy & Green', 'http://business-news.thestreet.com/the-oakland-press/rss/117435'),
                      ('Opinion', 'http://www.theoaklandpress.com/?rss=opinion'),
                      ('Entertainment', 'http://www.theoaklandpress.com/?rss=entertainment'),
                      ('Life', 'http://www.theoaklandpress.com/?rss=life'),
                      ('Luxury & Leisure', 'http://business-news.thestreet.com/the-oakland-press/rss/68877'),
                      ('Obituaries', 'http://www.legacy.com/obituaries/theoaklandpress/services/rss.ashx'),
                    ]

    def print_version(self, url):
      return url + '?viewmode=fullstory'

Starson17 · 07-16-2011, 09:54 AM

Quote:

Originally Posted by kbookie

Thanks for the speedy reply, Starson17.

I looked at it a couple more times and finally figured it out:

Great

Quote:

Not yet fancy, but it gets me the stories, anyway...

There's always some way to make it better. At some point you just say it's good enough and start another project.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Is there a way to get rid of this annoyance?	TonytheBookworm	Amazon Kindle	7	12-26-2010 08:59 PM
How can you get rid of duplicate books?	pmatch1104	Calibre	4	12-02-2010 11:08 PM
get rid of all tags - how ?	cybmole	Calibre	4	09-29-2010 08:50 AM
hi, i am currently getting rid of the	russellmz00	Introduce Yourself	6	05-25-2010 01:42 PM
Just to get rid of the message	pshrynk	Introduce Yourself	10	04-17-2009 01:47 AM

07-15-2011, 09:21 AM	#2
Starson17 Wizard Posts: 4,004 Karma: 177841 Join Date: Dec 2009 Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T	It's not clear to me what the problem is. At first glance, the code looks correct to me. It's possible that the site won't let you grab the full story link until after you've gotten the paged version. I'd be printing out the entire soup and see what the site is sending you (if you haven't already done that.) One alternative solution would be to write a multipage version of the recipe.

Advert