01-17-2011, 10:31 PM | #1 |
Member
Posts: 11
Karma: 10
Join Date: Jan 2011
Device: Kindle
|
Ars Technica recipe update
Hi,
Here is an update to the Ars Technica. What is the proper way with regards to Authors & copyright? I have updated the copyright to include 2011, and added myself to the list of authors - is this ok? Code:
__license__ = 'GPL v3' __copyright__ = '2008-2011, Darko Miletic <darko.miletic at gmail.com>' ''' arstechnica.com ''' import re from calibre.web.feeds.news import BasicNewsRecipe from calibre.ebooks.BeautifulSoup import BeautifulSoup, Tag class ArsTechnica(BasicNewsRecipe): title = u'Ars Technica' language = 'en' __author__ = 'Darko Miletic, Sujata Raman, Alexis Rohou' description = 'The art of technology' publisher = 'Ars Technica' category = 'news, IT, technology' oldest_article = 5 max_articles_per_feed = 100 no_stylesheets = True encoding = 'utf-8' use_embedded_content = False extra_css = ''' body {font-family: Arial,Helvetica,sans-serif} .title{text-align: left} .byline{font-weight: bold; line-height: 1em; font-size: 0.625em; text-decoration: none} .news-item-figure-caption-text{font-size:small; font-style:italic} .news-item-figure-caption-byline{font-size:small; font-style:italic; font-weight:bold} ''' ignoreEtcArticles = True # Etc feed items can be ignored, as they're not real stories conversion_options = { 'comments' : description ,'tags' : category ,'language' : language ,'publisher' : publisher } #preprocess_regexps = [ # (re.compile(r'<div class="news-item-figure', re.DOTALL|re.IGNORECASE),lambda match: '<div class="news-item-figure"') # ,(re.compile(r'</title>.*?</head>', re.DOTALL|re.IGNORECASE),lambda match: '</title></head>') # ] keep_only_tags = [dict(name='div', attrs={'id':['story','etc-story']})] remove_tags = [ dict(name=['object','link','embed']) ,dict(name='div', attrs={'class':'read-more-link'}) ] #remove_attributes=['width','height'] feeds = [ (u'Infinite Loop (Apple content)' , u'http://feeds.arstechnica.com/arstechnica/apple/' ) ,(u'Opposable Thumbs (Gaming content)' , u'http://feeds.arstechnica.com/arstechnica/gaming/' ) ,(u'Gear and Gadgets' , u'http://feeds.arstechnica.com/arstechnica/gadgets/' ) ,(u'Chipster (Hardware content)' , u'http://feeds.arstechnica.com/arstechnica/hardware/' ) ,(u'Uptime (IT content)' , u'http://feeds.arstechnica.com/arstechnica/business/' ) ,(u'Open Ended (Open Source content)' , u'http://feeds.arstechnica.com/arstechnica/open-source/') ,(u'One Microsoft Way' , u'http://feeds.arstechnica.com/arstechnica/microsoft/' ) ,(u'Nobel Intent (Science content)' , u'http://feeds.arstechnica.com/arstechnica/science/' ) ,(u'Law & Disorder (Tech policy content)' , u'http://feeds.arstechnica.com/arstechnica/tech-policy/') ] # This deals with multi-page stories def append_page(self, soup, appendtag, position): pager = soup.find('div',attrs={'class':'pager'}) if pager: for atag in pager.findAll('a',href=True): str = self.tag_to_string(atag) if str.startswith('Next'): nurl = 'http://arstechnica.com' + atag['href'] rawc = self.index_to_soup(nurl,True) soup2 = BeautifulSoup(rawc, fromEncoding=self.encoding) readmoretag = soup2.find('div', attrs={'class':'read-more-link'}) if readmoretag: readmoretag.extract() texttag = soup2.find('div', attrs={'class':'body'}) for it in texttag.findAll(style=True): del it['style'] newpos = len(texttag.contents) self.append_page(soup2,texttag,newpos) texttag.extract() pager.extract() appendtag.insert(position,texttag) def preprocess_html(self, soup): # Adds line breaks near the byline (not sure why this is needed) ftag = soup.find('div', attrs={'class':'byline'}) if ftag: brtag = Tag(soup,'br') brtag2 = Tag(soup,'br') ftag.insert(4,brtag) ftag.insert(5,brtag2) # Remove style items for item in soup.findAll(style=True): del item['style'] # Remove id for item in soup.findAll(id=True): del item['id'] # For some reason, links to authors don't have the domainname a_author = soup.find('a',{'href':re.compile("^/author")}) if a_author: a_author['href'] = 'http://arstechnica.com'+a_author['href'] # within div class news-item-figure, we need to grab images # Deal with multi-page stories self.append_page(soup, soup.body, 3) return soup def get_article_url(self, article): # If the article title starts with Etc:, don't return it if self.ignoreEtcArticles: article_title = article.get('title',None) if re.match('Etc: ',article_title) is not None: return None # The actual article is in a guid tag return article.get('guid', None).rpartition('?')[0] |
01-17-2011, 10:43 PM | #2 |
creator of calibre
Posts: 43,858
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
That's fine.
|
Advert | |
|
08-04-2011, 05:40 PM | #3 |
Junior Member
Posts: 7
Karma: 10
Join Date: Aug 2011
Device: Kindle 3
|
Just wanted to say many thanks for this, it's really helped me out.
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Ars Technica issues | cypherslock | Calibre | 2 | 01-24-2010 07:27 PM |
Ars Technica: The e-book wars of 2010 | wallcraft | News | 4 | 01-09-2010 12:44 AM |
Ars Technica CES preview: three e-readers to watch in 2010 | m-reader | News | 0 | 12-30-2009 12:43 AM |
Ars.Technica review of the iliad | Antartica | iRex | 3 | 02-19-2008 11:02 AM |
Sony Reader gets a 6/10 from Ars Technica | Alexander Turcic | Sony Reader | 32 | 11-17-2007 06:36 PM |