Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 07-15-2011, 05:06 AM   #1
kbookie
Junior Member
kbookie began at the beginning.
 
Posts: 2
Karma: 10
Join Date: Jul 2011
Device: Kindle 3
How do I get rid of this duplicate content?

I modified Darko's BBC script to get the full story version of The Oakland Press (Oakland County Michigan).

In the debug, it seems to be fetching the fullstory version, but the HTML is of the paged version.

Can anyone tell me how to get and keep only the full version so that I don't have any duplicate content?

Code:
'''
theoaklandpress.com
'''
import re 
from calibre.web.feeds.recipes import BasicNewsRecipe

class Oakland_Press(BasicNewsRecipe):
    title                  = 'The Oakland Press'
    __author__             = 'Roger Easlick'
    description            = 'Oakland County News '
    oldest_article         = 2
    max_articles_per_feed  = 100
    no_stylesheets         = True
    #delay                  = 1
    use_embedded_content   = False
    encoding               = 'utf8'
    publisher              = 'The Oakland Press'
    category               = 'news'
    language               = 'en_US'
    publication_type       = 'newsportal'
    extra_css              = ' body{ font-family: Verdana,Helvetica,Arial,sans-serif } .introduction{font-weight: bold} .story-feature{display: block; padding: 0; border: 1px solid; width: 40%; font-size: small} .story-feature h2{text-align: center; text-transform: uppercase} '
    preprocess_regexps     = [(re.compile(r'<!--.*?-->', re.DOTALL), lambda m: '')]
    conversion_options = {
                             'comments'        : description
                            ,'tags'            : category
                            ,'language'        : language
                            ,'publisher'       : publisher
                            ,'linearize_tables': True
                         }

    keep_only_tags    = [
                       dict(name='div', attrs={'class':['story_headline']})
                       ,dict(name='div', attrs={'class':['story_timestamp']})
                       ,dict(name='p', attrs={'class':['byline']})  
		       ,dict(name='div', attrs={'class':['story_body clear']})
                        ]

    remove_tags = [
                       dict(name='div', attrs={'class':['comments-link-block']})
                       ,dict(name='ul', attrs={'id':['paging']})
		        ]


    remove_attributes = ['width','height']

    feeds          = [
                      ('News', 'http://www.theoaklandpress.com/?rss=news'),
                    ]

    def print_version(self, url):
      return url + '?viewmode=fullstory'
Any help would be greatly appreciated!
kbookie is offline   Reply With Quote
Old 07-15-2011, 10:21 AM   #2
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
It's not clear to me what the problem is. At first glance, the code looks correct to me. It's possible that the site won't let you grab the full story link until after you've gotten the paged version. I'd be printing out the entire soup and see what the site is sending you (if you haven't already done that.) One alternative solution would be to write a multipage version of the recipe.
Starson17 is offline   Reply With Quote
Old 07-16-2011, 12:44 AM   #3
kbookie
Junior Member
kbookie began at the beginning.
 
Posts: 2
Karma: 10
Join Date: Jul 2011
Device: Kindle 3
Thanks for the speedy reply, Starson17.

I looked at it a couple more times and finally figured it out:

I was asking for the class called fullstory instead of the ID called fullstory

Now the FULL recipe code looks like this and works like a charm. Not yet fancy, but it gets me the stories, anyway...
Code:
__license__   = 'GPL v3'
__copyright__ = '2011, Roger Easlick <roger.easlick at gmail.com>'
'''
theoaklandpress.com
'''
import re 
from calibre.web.feeds.recipes import BasicNewsRecipe

class Oakland_Press(BasicNewsRecipe):
    title                  = 'The Oakland Press'
    __author__             = 'Roger Easlick'
    description            = 'Oakland County News '
    oldest_article         = 2
    max_articles_per_feed  = 100
    no_stylesheets         = True
    #delay                  = 1
    use_embedded_content   = False
    encoding               = 'utf8'
    publisher              = 'The Oakland Press'
    category               = 'news'
    language               = 'en_US'
    publication_type       = 'newsportal'
    extra_css              = ' body{ font-family: Verdana,Helvetica,Arial,sans-serif } .introduction{font-weight: bold} .story-feature{display: block; padding: 0; border: 1px solid; width: 40%; font-size: small} .story-feature h2{text-align: center; text-transform: uppercase} '
    preprocess_regexps     = [(re.compile(r'<!--.*?-->', re.DOTALL), lambda m: '')]
    conversion_options = {
                             'comments'        : description
                            ,'tags'            : category
                            ,'language'        : language
                            ,'publisher'       : publisher
                            ,'linearize_tables': True
                         }

    keep_only_tags    = [
                       dict(name='div', attrs={'class':['story_headline']})
                       ,dict(name='div', attrs={'class':['story_timestamp']})
		       ,dict(name='div', attrs={'id':['fullstory']})
                        ]

    remove_tags = [
                       dict(name='div', attrs={'class':['comments-link-block']})
                       ,dict(name='ul', attrs={'id':['paging']})
		        ]


    remove_attributes = ['width','height']

    feeds          = [
                      ('News', 'http://www.theoaklandpress.com/?rss=news'),
                      ('Sports', 'http://www.theoaklandpress.com/?rss=sports'),
                      ('Business', 'http://business-news.thestreet.com/the-oakland-press/rss/109411'),
                      ('Personal Finance', 'http://business-news.thestreet.com/the-oakland-press/rss/627'),
                      ('Investing Tips', 'http://business-news.thestreet.com/the-oakland-press/rss/117429'),
                      ('Mobile & Gadgets', 'http://business-news.thestreet.com/the-oakland-press/rss/115115'),
                      ('Energy & Green', 'http://business-news.thestreet.com/the-oakland-press/rss/117435'),
                      ('Opinion', 'http://www.theoaklandpress.com/?rss=opinion'),
                      ('Entertainment', 'http://www.theoaklandpress.com/?rss=entertainment'),
                      ('Life', 'http://www.theoaklandpress.com/?rss=life'),
                      ('Luxury & Leisure', 'http://business-news.thestreet.com/the-oakland-press/rss/68877'),
                      ('Obituaries', 'http://www.legacy.com/obituaries/theoaklandpress/services/rss.ashx'),
                    ]

    def print_version(self, url):
      return url + '?viewmode=fullstory'
kbookie is offline   Reply With Quote
Old 07-16-2011, 10:54 AM   #4
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by kbookie View Post
Thanks for the speedy reply, Starson17.

I looked at it a couple more times and finally figured it out:
Great

Quote:
Not yet fancy, but it gets me the stories, anyway...
There's always some way to make it better. At some point you just say it's good enough and start another project.
Starson17 is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Is there a way to get rid of this annoyance? TonytheBookworm Amazon Kindle 7 12-26-2010 09:59 PM
How can you get rid of duplicate books? pmatch1104 Calibre 4 12-03-2010 12:08 AM
get rid of all tags - how ? cybmole Calibre 4 09-29-2010 09:50 AM
hi, i am currently getting rid of the russellmz00 Introduce Yourself 6 05-25-2010 02:42 PM
Just to get rid of the message pshrynk Introduce Yourself 10 04-17-2009 02:47 AM


All times are GMT -4. The time now is 09:37 PM.


MobileRead.com is a privately owned, operated and funded community.