Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 01-13-2013, 11:40 AM   #1
rainrdx
Connoisseur
rainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy blue
 
Posts: 55
Karma: 13316
Join Date: Jul 2012
Device: iPad
Smithsonian Mag

I thought I have posted it but somehow I can't find the original post. Maybe my memory's failing me again. Anyways, this is a terribly written recipe for Smithsonian Mag and it works.

Code:
import re
from calibre.web.feeds.recipes import BasicNewsRecipe
from collections import OrderedDict

class Smithsonian(BasicNewsRecipe):

    title       = 'Smithsonian Magazine'
    __author__  = 'Rick Shang'

    description = 'This magazine chronicles the arts, environment, sciences and popular culture of the times. It is edited for modern, well-rounded individuals with diverse, general interests. With your order, you become a National Associate Member of the Smithsonian. Membership benefits include your subscription to Smithsonian magazine, a personalized membership card, discounts from the Smithsonian catalog, and more.'
    language = 'en'
    category = 'news'
    encoding = 'UTF-8'
    keep_only_tags = [dict(attrs={'id':['articleTitle', 'subHead', 'byLine', 'articleImage', 'article-text']})]
    remove_tags = [dict(attrs={'class':['related-articles-inpage', 'viewMorePhotos']})]
    no_javascript = True
    no_stylesheets = True

    def parse_index(self):

	#Go to the issue
        soup0 = self.index_to_soup('http://www.smithsonianmag.com/issue/archive/')
        div = soup0.find('div',attrs={'id':'archives'})
        issue = div.find('ul',attrs={'class':'clear-both'})
	current_issue_url = issue.find('a', href=True)['href']
        soup = self.index_to_soup(current_issue_url)

	#Go to the main body
	div = soup.find ('div', attrs={'id':'content-inset'})

	#Find date
	date = re.sub('.*\:\W*', "", self.tag_to_string(div.find('h2')).strip())
	self.timefmt = u' [%s]'%date

	#Find cover
	self.cover_url = div.find('img',src=True)['src']	

        feeds = OrderedDict()
	section_title = ''
	subsection_title = ''
        for post in div.findAll('div', attrs={'class':['plainModule', 'departments plainModule']}):
		articles = []
		prefix = ''
		h3=post.find('h3')
		if h3 is not None:
			section_title = self.tag_to_string(h3)
		else:
			subsection=post.find('p',attrs={'class':'article-cat'})
			link=post.find('a',href=True)
			url=link['href']+'?c=y&story=fullstory'
			if subsection is not None:
				subsection_title = self.tag_to_string(subsection).strip()
				prefix = (subsection_title+': ')
				description=self.tag_to_string(post('p', limit=2)[1]).strip()
			else:
				if post.find('img') is not None:
					subsection_title = self.tag_to_string(post.findPrevious('div', attrs={'class':'departments plainModule'}).find('p', attrs={'class':'article-cat'})).strip()
					prefix = (subsection_title+': ')

				description=self.tag_to_string(post.find('p')).strip()
			desc=re.sub('\sBy\s.*', '', description, re.DOTALL)
			author=re.sub('.*By\s', '', description, re.DOTALL)
			title=prefix + self.tag_to_string(link).strip()+ u' (%s)'%author
			articles.append({'title':title, 'url':url, 'description':desc, 'date':''})
		
		if articles:
			if section_title not in feeds:
	                    feeds[section_title] = []
			feeds[section_title] += articles
        ans = [(key, val) for key, val in feeds.iteritems()]
        return ans
rainrdx is offline   Reply With Quote
Old 04-26-2013, 07:53 PM   #2
rainrdx
Connoisseur
rainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy blue
 
Posts: 55
Karma: 13316
Join Date: Jul 2012
Device: iPad
Update: fixed missing articles, better processing of article categories, minor bug fixes.

Code:
import re
from calibre.web.feeds.recipes import BasicNewsRecipe
from collections import OrderedDict

class Smithsonian(BasicNewsRecipe):

    title       = 'Smithsonian Magazine'
    __author__  = 'Rick Shang'

    description = 'This magazine chronicles the arts, environment, sciences and popular culture of the times. It is edited for modern, well-rounded individuals with diverse, general interests. With your order, you become a National Associate Member of the Smithsonian. Membership benefits include your subscription to Smithsonian magazine, a personalized membership card, discounts from the Smithsonian catalog, and more.'
    language = 'en'
    category = 'news'
    encoding = 'UTF-8'
    keep_only_tags = [dict(attrs={'id':['articleTitle', 'subHead', 'byLine', 'articleImage', 'article-text']})]
    remove_tags = [dict(attrs={'class':['related-articles-inpage', 'viewMorePhotos']})]
    no_javascript = True
    no_stylesheets = True

    def parse_index(self):
        #Go to the issue
        soup0 = self.index_to_soup('http://www.smithsonianmag.com/issue/archive/')
        div = soup0.find('div',attrs={'id':'archives'})
        issue = div.find('ul',attrs={'class':'clear-both'})
        current_issue_url = issue.find('a', href=True)['href']
        soup = self.index_to_soup(current_issue_url)

        #Go to the main body
        div = soup.find ('div', attrs={'id':'article-body'})

        #Find date
        date = re.sub('.*\:\W*', "", self.tag_to_string(div.find('h2')).strip())
        self.timefmt = u' [%s]'%date

        #Find cover
        self.cover_url = div.find('img',src=True)['src']

        feeds = OrderedDict()
        section_title = ''
        articles = []
        for post in div.findAll('div', attrs={'class':['plainModule', 'departments plainModule']}):
            h4=post.find('h3')
            if h4 is not None:
                if articles:
                    if section_title not in feeds:
                        feeds[section_title] = []
                    feeds[section_title] += articles
                section_title = self.tag_to_string(h4)
                articles = []
                self.log('Found section:', section_title)
            else:
                link=post.find('a',href=True)
		article_cat=link.findPrevious('p', attrs={'class':'article-cat'})
                url=link['href']+'?c=y&story=fullstory'
                description=self.tag_to_string(post.findAll('p')[-1]).strip()
		title=self.tag_to_string(link).strip()
		if article_cat is not None:
			title += u' (%s)'%self.tag_to_string(article_cat).strip()
                self.log('\tFound article:', title)
                articles.append({'title':title, 'url':url, 'description':description, 'date':''})

        if articles:
	    if section_title not in feeds:
                        feeds[section_title] = []
            feeds[section_title] += articles
	    articles = []
	
        ans = [(key, val) for key, val in feeds.iteritems()]
        return ans
rainrdx is offline   Reply With Quote
Advert
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Mod for Smithsonian to clean up--problem removing untagged text terminalveracity Recipes 4 06-14-2012 05:59 PM
recipe for Smithsonian mag xXxXxXxXxXx Recipes 0 05-22-2011 11:46 AM
Free ebook short story - THE SMITHSONIAN OBJECTIVE sakman70 Self-Promotions by Authors and Publishers 1 05-04-2011 10:45 PM
Smithsonian Magazine not working mkgtu Recipes 2 01-07-2011 12:16 PM
Another PC Mag Review -- 07/24/06 NatCh Sony Reader 9 09-06-2006 10:56 PM


All times are GMT -4. The time now is 04:28 AM.


MobileRead.com is a privately owned, operated and funded community.