View Single Post
Old 01-16-2011, 01:20 AM   #4
mufc
Connoisseur
mufc doesn't littermufc doesn't litter
 
Posts: 99
Karma: 170
Join Date: Nov 2010
Location: Airdrie Alberta
Device: Sony 650
More articles deleted

"Added code to find and eliminate duplicating articles across multiple feeds (i.e., the same article showing up in Business and Investing)"
Spoiler:
def parse_feeds(self, *args, **kwargs):
parsed_feeds = BasicNewsRecipe.parse_feeds(self, *args, **kwargs)
# Eliminate the duplicates
urlSet = set()

for feed in parsed_feeds:
newArticles = []
for article in feed:
if article.url in urlSet:
feed.articles.remove( article )
else:
urlSet.add(article.url)
newArticles.append(article)

feed.articles = newArticles

return parsed_feeds



If you run the recipe without this and with it you will notice that when it deletes an article it also deletes the article below it that does not occur anywhere else so you actually lose articles.

Also
Spoiler:
def postprocess_html(self, soup, first_fetch):
# Find and preserve single page article layout, can be first or last
allArts = soup.findAll(True, {'id':'article'})
if len(allArts)==2:
if(len(allArts[0].contents)>len(allArts[1].contents)):
allArts[1].extract()
else:
allArts[0].extract()

return soup



All this did was get rid of the links to the rest of the pages but did not add the rest of the article from the other pages
mufc is offline   Reply With Quote