07-15-2011, 04:06 AM | #1 |
Junior Member
Posts: 2
Karma: 10
Join Date: Jul 2011
Device: Kindle 3
|
How do I get rid of this duplicate content?
I modified Darko's BBC script to get the full story version of The Oakland Press (Oakland County Michigan).
In the debug, it seems to be fetching the fullstory version, but the HTML is of the paged version. Can anyone tell me how to get and keep only the full version so that I don't have any duplicate content? Code:
''' theoaklandpress.com ''' import re from calibre.web.feeds.recipes import BasicNewsRecipe class Oakland_Press(BasicNewsRecipe): title = 'The Oakland Press' __author__ = 'Roger Easlick' description = 'Oakland County News ' oldest_article = 2 max_articles_per_feed = 100 no_stylesheets = True #delay = 1 use_embedded_content = False encoding = 'utf8' publisher = 'The Oakland Press' category = 'news' language = 'en_US' publication_type = 'newsportal' extra_css = ' body{ font-family: Verdana,Helvetica,Arial,sans-serif } .introduction{font-weight: bold} .story-feature{display: block; padding: 0; border: 1px solid; width: 40%; font-size: small} .story-feature h2{text-align: center; text-transform: uppercase} ' preprocess_regexps = [(re.compile(r'<!--.*?-->', re.DOTALL), lambda m: '')] conversion_options = { 'comments' : description ,'tags' : category ,'language' : language ,'publisher' : publisher ,'linearize_tables': True } keep_only_tags = [ dict(name='div', attrs={'class':['story_headline']}) ,dict(name='div', attrs={'class':['story_timestamp']}) ,dict(name='p', attrs={'class':['byline']}) ,dict(name='div', attrs={'class':['story_body clear']}) ] remove_tags = [ dict(name='div', attrs={'class':['comments-link-block']}) ,dict(name='ul', attrs={'id':['paging']}) ] remove_attributes = ['width','height'] feeds = [ ('News', 'http://www.theoaklandpress.com/?rss=news'), ] def print_version(self, url): return url + '?viewmode=fullstory' |
07-15-2011, 09:21 AM | #2 |
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
It's not clear to me what the problem is. At first glance, the code looks correct to me. It's possible that the site won't let you grab the full story link until after you've gotten the paged version. I'd be printing out the entire soup and see what the site is sending you (if you haven't already done that.) One alternative solution would be to write a multipage version of the recipe.
|
Advert | |
|
07-15-2011, 11:44 PM | #3 |
Junior Member
Posts: 2
Karma: 10
Join Date: Jul 2011
Device: Kindle 3
|
Thanks for the speedy reply, Starson17.
I looked at it a couple more times and finally figured it out: I was asking for the class called fullstory instead of the ID called fullstory Now the FULL recipe code looks like this and works like a charm. Not yet fancy, but it gets me the stories, anyway... Code:
__license__ = 'GPL v3' __copyright__ = '2011, Roger Easlick <roger.easlick at gmail.com>' ''' theoaklandpress.com ''' import re from calibre.web.feeds.recipes import BasicNewsRecipe class Oakland_Press(BasicNewsRecipe): title = 'The Oakland Press' __author__ = 'Roger Easlick' description = 'Oakland County News ' oldest_article = 2 max_articles_per_feed = 100 no_stylesheets = True #delay = 1 use_embedded_content = False encoding = 'utf8' publisher = 'The Oakland Press' category = 'news' language = 'en_US' publication_type = 'newsportal' extra_css = ' body{ font-family: Verdana,Helvetica,Arial,sans-serif } .introduction{font-weight: bold} .story-feature{display: block; padding: 0; border: 1px solid; width: 40%; font-size: small} .story-feature h2{text-align: center; text-transform: uppercase} ' preprocess_regexps = [(re.compile(r'<!--.*?-->', re.DOTALL), lambda m: '')] conversion_options = { 'comments' : description ,'tags' : category ,'language' : language ,'publisher' : publisher ,'linearize_tables': True } keep_only_tags = [ dict(name='div', attrs={'class':['story_headline']}) ,dict(name='div', attrs={'class':['story_timestamp']}) ,dict(name='div', attrs={'id':['fullstory']}) ] remove_tags = [ dict(name='div', attrs={'class':['comments-link-block']}) ,dict(name='ul', attrs={'id':['paging']}) ] remove_attributes = ['width','height'] feeds = [ ('News', 'http://www.theoaklandpress.com/?rss=news'), ('Sports', 'http://www.theoaklandpress.com/?rss=sports'), ('Business', 'http://business-news.thestreet.com/the-oakland-press/rss/109411'), ('Personal Finance', 'http://business-news.thestreet.com/the-oakland-press/rss/627'), ('Investing Tips', 'http://business-news.thestreet.com/the-oakland-press/rss/117429'), ('Mobile & Gadgets', 'http://business-news.thestreet.com/the-oakland-press/rss/115115'), ('Energy & Green', 'http://business-news.thestreet.com/the-oakland-press/rss/117435'), ('Opinion', 'http://www.theoaklandpress.com/?rss=opinion'), ('Entertainment', 'http://www.theoaklandpress.com/?rss=entertainment'), ('Life', 'http://www.theoaklandpress.com/?rss=life'), ('Luxury & Leisure', 'http://business-news.thestreet.com/the-oakland-press/rss/68877'), ('Obituaries', 'http://www.legacy.com/obituaries/theoaklandpress/services/rss.ashx'), ] def print_version(self, url): return url + '?viewmode=fullstory' |
07-16-2011, 09:54 AM | #4 | ||
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Quote:
Quote:
|
||
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Is there a way to get rid of this annoyance? | TonytheBookworm | Amazon Kindle | 7 | 12-26-2010 08:59 PM |
How can you get rid of duplicate books? | pmatch1104 | Calibre | 4 | 12-02-2010 11:08 PM |
get rid of all tags - how ? | cybmole | Calibre | 4 | 09-29-2010 08:50 AM |
hi, i am currently getting rid of the | russellmz00 | Introduce Yourself | 6 | 05-25-2010 01:42 PM |
Just to get rid of the message | pshrynk | Introduce Yourself | 10 | 04-17-2009 01:47 AM |