Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 08-25-2012, 01:14 AM   #1
rainrdx
Connoisseur
rainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy blue
 
Posts: 55
Karma: 13316
Join Date: Jul 2012
Device: iPad
Business Week Magazine

Unlike ones already in the inventory, this does not read from rss of Business Week news. It replicates the weekly magazine.

Code:
import re
from calibre.web.feeds.recipes import BasicNewsRecipe
from collections import OrderedDict

class BusinessWeekMagazine(BasicNewsRecipe):

    title       = 'Business Week Magazine'
    __author__  = 'Rick Shang'

    description = 'A renowned business publication. Business news, trends and profiles of successful businesspeople.'
    language = 'en'
    category = 'news'
    encoding = 'UTF-8'
    keep_only_tags = [
			dict(name='div', attrs={'id':'article_body_container'}),
			]
    remove_tags = [dict(name='ui'),dict(name='li')]
    no_javascript = True
    no_stylesheets = True
	
    cover_url             = 'http://images.businessweek.com/mz/covers/current_120x160.jpg'

    def parse_index(self):

	#Go to the issue
        soup = self.index_to_soup('http://www.businessweek.com/magazine/news/articles/business_news.htm')
	
	#Find date
	mag=soup.find('h2',text='Magazine')
	self.log(mag)
	dates=self.tag_to_string(mag.findNext('h3'))
	self.timefmt = u' [%s]'%dates

        #Go to the main body
	div0 = soup.find ('div', attrs={'class':'column left'})	
	section_title = ''
        feeds = OrderedDict()
	for div in div0.findAll('a'):
		articles = []
		section_title = self.tag_to_string(div.findPrevious('h3')).strip()
		title=self.tag_to_string(div).strip()
		url=div['href']
		soup0 = self.index_to_soup(url)
		urlprint=soup0.find('li', attrs={'class':'print'}).a['href']
		articles.append({'title':title, 'url':urlprint, 'description':'', 'date':''})

		
		if articles:
			if section_title not in feeds:
				feeds[section_title] = []
			feeds[section_title] += articles
	div1 = soup.find ('div', attrs={'class':'column center'})	
	section_title = ''
	for div in div1.findAll('a'):
		articles = []
		desc=self.tag_to_string(div.findNext('p')).strip()
		section_title = self.tag_to_string(div.findPrevious('h3')).strip()
		title=self.tag_to_string(div).strip()
		url=div['href']
		soup0 = self.index_to_soup(url)
		urlprint=soup0.find('li', attrs={'class':'print'}).a['href']
		articles.append({'title':title, 'url':urlprint, 'description':desc, 'date':''})

		if articles:
			if section_title not in feeds:
				feeds[section_title] = []
			feeds[section_title] += articles
        ans = [(key, val) for key, val in feeds.iteritems()]
        return ans
rainrdx is offline   Reply With Quote
Old 09-01-2012, 08:13 PM   #2
rainrdx
Connoisseur
rainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy blue
 
Posts: 55
Karma: 13316
Join Date: Jul 2012
Device: iPad
Update:
Bug Fix
Code:
import re
from calibre.web.feeds.recipes import BasicNewsRecipe
from collections import OrderedDict

class BusinessWeekMagazine(BasicNewsRecipe):

    title       = 'Business Week Magazine'
    __author__  = 'Rick Shang'

    description = 'A renowned business publication. Business news, trends and profiles of successful businesspeople.'
    language = 'en'
    category = 'news'
    encoding = 'UTF-8'
    keep_only_tags = [
			dict(name='div', attrs={'id':'article_body_container'}),
			]
    remove_tags = [dict(name='ui'),dict(name='li')]
    no_javascript = True
    no_stylesheets = True
	
    cover_url             = 'http://images.businessweek.com/mz/covers/current_120x160.jpg'

    def parse_index(self):

	#Go to the issue
        soup = self.index_to_soup('http://www.businessweek.com/magazine/news/articles/business_news.htm')
	
	#Find date
	mag=soup.find('h2',text='Magazine')
	self.log(mag)
	dates=self.tag_to_string(mag.findNext('h3'))
	self.timefmt = u' [%s]'%dates

        #Go to the main body
	div0 = soup.find ('div', attrs={'class':'column left'})	
	section_title = ''
        feeds = OrderedDict()
	for div in div0.findAll('h4'):
		articles = []
		section_title = self.tag_to_string(div.findPrevious('h3')).strip()
		title=self.tag_to_string(div.a).strip()
		url=div.a['href']
		soup0 = self.index_to_soup(url)
		urlprint=soup0.find('li', attrs={'class':'print'}).a['href']
		articles.append({'title':title, 'url':urlprint, 'description':'', 'date':''})

		
		if articles:
			if section_title not in feeds:
				feeds[section_title] = []
			feeds[section_title] += articles
	div1 = soup.find ('div', attrs={'class':'column center'})	
	section_title = ''
	for div in div1.findAll('h5'):
		articles = []
		desc=self.tag_to_string(div.findNext('p')).strip()
		section_title = self.tag_to_string(div.findPrevious('h3')).strip()
		title=self.tag_to_string(div.a).strip()
		url=div.a['href']
		soup0 = self.index_to_soup(url)
		urlprint=soup0.find('li', attrs={'class':'print'}).a['href']
		articles.append({'title':title, 'url':urlprint, 'description':desc, 'date':''})

		if articles:
			if section_title not in feeds:
				feeds[section_title] = []
			feeds[section_title] += articles
        ans = [(key, val) for key, val in feeds.iteritems()]
        return ans
rainrdx is offline   Reply With Quote
Old 01-11-2013, 01:48 AM   #3
rainrdx
Connoisseur
rainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy blue
 
Posts: 55
Karma: 13316
Join Date: Jul 2012
Device: iPad
Update:
Polished the article layout

Code:
import re
from calibre.web.feeds.recipes import BasicNewsRecipe
from collections import OrderedDict

class BusinessWeekMagazine(BasicNewsRecipe):

    title       = 'Business Week Magazine'
    __author__  = 'Rick Shang'

    description = 'A renowned business publication. Business news, trends and profiles of successful businesspeople.'
    language = 'en'
    category = 'news'
    encoding = 'UTF-8'
    keep_only_tags = [
			dict(name='div', attrs={'id':'article_body_container'}),
			]
    remove_tags = [dict(name='ui'),dict(name='li'),dict(name='div', attrs={'id':['share-email']})]
    no_javascript = True
    no_stylesheets = True
	
    cover_url             = 'http://images.businessweek.com/mz/covers/current_120x160.jpg'

    def parse_index(self):

	#Go to the issue
        soup = self.index_to_soup('http://www.businessweek.com/magazine/news/articles/business_news.htm')
	
	#Find date
	mag=soup.find('h2',text='Magazine')
	self.log(mag)
	dates=self.tag_to_string(mag.findNext('h3'))
	self.timefmt = u' [%s]'%dates

        #Go to the main body
	div0 = soup.find ('div', attrs={'class':'column left'})	
	section_title = ''
        feeds = OrderedDict()
	for div in div0.findAll('h4'):
		articles = []
		section_title = self.tag_to_string(div.findPrevious('h3')).strip()
		title=self.tag_to_string(div.a).strip()
		url=div.a['href']
		soup0 = self.index_to_soup(url)
		urlprint=soup0.find('li', attrs={'class':'print'}).a['href']
		articles.append({'title':title, 'url':urlprint, 'description':'', 'date':''})

		
		if articles:
			if section_title not in feeds:
				feeds[section_title] = []
			feeds[section_title] += articles
	div1 = soup.find ('div', attrs={'class':'column center'})	
	section_title = ''
	for div in div1.findAll('h5'):
		articles = []
		desc=self.tag_to_string(div.findNext('p')).strip()
		section_title = self.tag_to_string(div.findPrevious('h3')).strip()
		title=self.tag_to_string(div.a).strip()
		url=div.a['href']
		soup0 = self.index_to_soup(url)
		urlprint=soup0.find('li', attrs={'class':'print'}).a['href']
		articles.append({'title':title, 'url':urlprint, 'description':desc, 'date':''})

		if articles:
			if section_title not in feeds:
				feeds[section_title] = []
			feeds[section_title] += articles
        ans = [(key, val) for key, val in feeds.iteritems()]
        return ans
rainrdx is offline   Reply With Quote
Old 01-17-2013, 09:28 PM   #4
dhiru
Connoisseur
dhiru began at the beginning.
 
Posts: 83
Karma: 10
Join Date: Aug 2009
Device: iphone, Irex iliad, sony prs950, kindle Dx, Ipad
how to use it. its getting error and not downloading anything.
thanks for efforts
dhiru is offline   Reply With Quote
Old 01-17-2013, 09:54 PM   #5
rainrdx
Connoisseur
rainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy blue
 
Posts: 55
Karma: 13316
Join Date: Jul 2012
Device: iPad
Thanks for notifying me the issue. BW changed the page a little bit. This is the fix

Code:
import re
from calibre.web.feeds.recipes import BasicNewsRecipe
from collections import OrderedDict

class BusinessWeekMagazine(BasicNewsRecipe):

    title       = 'Business Week Magazine'
    __author__  = 'Rick Shang'

    description = 'A renowned business publication. Business news, trends and profiles of successful businesspeople.'
    language = 'en'
    category = 'news'
    encoding = 'UTF-8'
    keep_only_tags = [
			dict(name='div', attrs={'id':'article_body_container'}),
			]
    remove_tags = [dict(name='ui'),dict(name='li'),dict(name='div', attrs={'id':['share-email']})]
    no_javascript = True
    no_stylesheets = True
	
    cover_url             = 'http://images.businessweek.com/mz/covers/current_120x160.jpg'

    def parse_index(self):

	#Go to the issue
        soup = self.index_to_soup('http://www.businessweek.com/magazine/news/articles/business_news.htm')
	
	#Find date
	mag=soup.find('h2',text='Magazine')
	self.log(mag)
	dates=self.tag_to_string(mag.findNext('h3'))
	self.timefmt = u' [%s]'%dates

        #Go to the main body
	div0 = soup.find ('div', attrs={'class':'column left'})	
	section_title = ''
        feeds = OrderedDict()
	for div in div0.findAll('h4'):
		articles = []
		section_title = self.tag_to_string(div.findPrevious('h3')).strip()
		title=self.tag_to_string(div.a).strip()
		url=div.a['href']
		soup0 = self.index_to_soup(url)
		urlprint=soup0.find('li', attrs={'class':'print tracked'}).a['href']
		articles.append({'title':title, 'url':urlprint, 'description':'', 'date':''})

		
		if articles:
			if section_title not in feeds:
				feeds[section_title] = []
			feeds[section_title] += articles
	div1 = soup.find ('div', attrs={'class':'column center'})	
	section_title = ''
	for div in div1.findAll('h5'):
		articles = []
		desc=self.tag_to_string(div.findNext('p')).strip()
		section_title = self.tag_to_string(div.findPrevious('h3')).strip()
		title=self.tag_to_string(div.a).strip()
		url=div.a['href']
		soup0 = self.index_to_soup(url)
		urlprint=soup0.find('li', attrs={'class':'print tracked'}).a['href']
		articles.append({'title':title, 'url':urlprint, 'description':desc, 'date':''})

		if articles:
			if section_title not in feeds:
				feeds[section_title] = []
			feeds[section_title] += articles
        ans = [(key, val) for key, val in feeds.iteritems()]
        return ans
rainrdx is offline   Reply With Quote
Old 01-18-2013, 02:49 AM   #6
dhiru
Connoisseur
dhiru began at the beginning.
 
Posts: 83
Karma: 10
Join Date: Aug 2009
Device: iphone, Irex iliad, sony prs950, kindle Dx, Ipad
thanks working now
dhiru is offline   Reply With Quote
Old 01-26-2013, 05:29 AM   #7
Mixx
Zealot
Mixx has a complete set of Star Wars action figures.Mixx has a complete set of Star Wars action figures.Mixx has a complete set of Star Wars action figures.Mixx has a complete set of Star Wars action figures.
 
Posts: 143
Karma: 387
Join Date: Sep 2010
Device: Kindle 3
I am sorry, it stopped working for me as of this week, I am getting this error log

Spoiler:

Fetch news from Business Week Magazine
Failed to initialize plugin: u'C:\\Users\\Mixx\\AppData\\Roaming\\calibre\\plug ins\\Amazon German.zip'
Resolved conversion options
calibre version: 0.9.14
{'asciiize': False,
'author_sort': None,
'authors': None,
'base_font_size': 0,
'book_producer': None,
'change_justification': 'original',
'chapter': None,
'chapter_mark': 'pagebreak',
'comments': None,
'cover': None,
'debug_pipeline': None,
'dehyphenate': True,
'delete_blank_paragraphs': True,
'disable_font_rescaling': False,
'dont_compress': False,
'dont_download_recipe': False,
'duplicate_links_in_toc': False,
'embed_font_family': None,
'enable_heuristics': False,
'extra_css': None,
'extract_to': None,
'filter_css': None,
'fix_indents': True,
'font_size_mapping': None,
'format_scene_breaks': True,
'html_unwrap_factor': 0.4,
'input_encoding': None,
'input_profile': <calibre.customize.profiles.InputProfile object at 0x0000000004F64B70>,
'insert_blank_line': False,
'insert_blank_line_size': 0.5,
'insert_metadata': False,
'isbn': None,
'italicize_common_cases': True,
'keep_ligatures': False,
'language': None,
'level1_toc': None,
'level2_toc': None,
'level3_toc': None,
'line_height': 0,
'linearize_tables': False,
'lrf': False,
'margin_bottom': 5.0,
'margin_left': 5.0,
'margin_right': 5.0,
'margin_top': 5.0,
'markup_chapter_headings': True,
'max_toc_links': 50,
'minimum_line_height': 120.0,
'mobi_file_type': 'old',
'mobi_ignore_margins': False,
'mobi_keep_original_images': False,
'mobi_toc_at_start': False,
'no_chapters_in_toc': False,
'no_inline_navbars': True,
'no_inline_toc': False,
'output_profile': <calibre.customize.profiles.KindleOutput object at 0x0000000004F67160>,
'page_breaks_before': None,
'personal_doc': '[PDOC]',
'prefer_author_sort': False,
'prefer_metadata_cover': False,
'pretty_print': False,
'pubdate': None,
'publisher': None,
'rating': None,
'read_metadata_from_opf': None,
'remove_fake_margins': True,
'remove_first_image': False,
'remove_paragraph_spacing': False,
'remove_paragraph_spacing_indent_size': 1.5,
'renumber_headings': True,
'replace_scene_breaks': '',
'search_replace': None,
'series': None,
'series_index': None,
'share_not_sync': False,
'smarten_punctuation': False,
'sr1_replace': '',
'sr1_search': '',
'sr2_replace': '',
'sr2_search': '',
'sr3_replace': '',
'sr3_search': '',
'start_reading_at': None,
'subset_embedded_fonts': False,
'tags': None,
'test': False,
'timestamp': None,
'title': None,
'title_sort': None,
'toc_filter': None,
'toc_threshold': 6,
'toc_title': None,
'unsmarten_punctuation': False,
'unwrap_lines': True,
'use_auto_toc': False,
'verbose': 2}
InputFormatPlugin: Recipe Input running
Using custom recipe
Magazine
Python function terminated unexpectedly
'NoneType' object has no attribute 'a' (Error Code: 1)
Traceback (most recent call last):
File "site.py", line 132, in main
File "site.py", line 109, in run_entry_point
File "site-packages\calibre\utils\ipc\worker.py", line 186, in main
File "site-packages\calibre\gui2\convert\gui_conversion.py", line 25, in gui_convert
File "site-packages\calibre\ebooks\conversion\plumber.py", line 1009, in run
File "site-packages\calibre\customize\conversion.py", line 239, in __call__
File "site-packages\calibre\ebooks\conversion\plugins\recipe_ input.py", line 109, in convert
File "site-packages\calibre\web\feeds\news.py", line 891, in download
File "site-packages\calibre\web\feeds\news.py", line 1058, in build_index
File "<string>", line 44, in parse_index
AttributeError: 'NoneType' object has no attribute 'a'


Thanks for looking at it.

Cheers, Mixx
Mixx is offline   Reply With Quote
Old 01-26-2013, 04:36 PM   #8
rainrdx
Connoisseur
rainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy blue
 
Posts: 55
Karma: 13316
Join Date: Jul 2012
Device: iPad
Quote:
Originally Posted by Mixx View Post
I am sorry, it stopped working for me as of this week, I am getting this error log

Spoiler:

Fetch news from Business Week Magazine
Failed to initialize plugin: u'C:\\Users\\Mixx\\AppData\\Roaming\\calibre\\plug ins\\Amazon German.zip'
Resolved conversion options
calibre version: 0.9.14
{'asciiize': False,
'author_sort': None,
'authors': None,
'base_font_size': 0,
'book_producer': None,
'change_justification': 'original',
'chapter': None,
'chapter_mark': 'pagebreak',
'comments': None,
'cover': None,
'debug_pipeline': None,
'dehyphenate': True,
'delete_blank_paragraphs': True,
'disable_font_rescaling': False,
'dont_compress': False,
'dont_download_recipe': False,
'duplicate_links_in_toc': False,
'embed_font_family': None,
'enable_heuristics': False,
'extra_css': None,
'extract_to': None,
'filter_css': None,
'fix_indents': True,
'font_size_mapping': None,
'format_scene_breaks': True,
'html_unwrap_factor': 0.4,
'input_encoding': None,
'input_profile': <calibre.customize.profiles.InputProfile object at 0x0000000004F64B70>,
'insert_blank_line': False,
'insert_blank_line_size': 0.5,
'insert_metadata': False,
'isbn': None,
'italicize_common_cases': True,
'keep_ligatures': False,
'language': None,
'level1_toc': None,
'level2_toc': None,
'level3_toc': None,
'line_height': 0,
'linearize_tables': False,
'lrf': False,
'margin_bottom': 5.0,
'margin_left': 5.0,
'margin_right': 5.0,
'margin_top': 5.0,
'markup_chapter_headings': True,
'max_toc_links': 50,
'minimum_line_height': 120.0,
'mobi_file_type': 'old',
'mobi_ignore_margins': False,
'mobi_keep_original_images': False,
'mobi_toc_at_start': False,
'no_chapters_in_toc': False,
'no_inline_navbars': True,
'no_inline_toc': False,
'output_profile': <calibre.customize.profiles.KindleOutput object at 0x0000000004F67160>,
'page_breaks_before': None,
'personal_doc': '[PDOC]',
'prefer_author_sort': False,
'prefer_metadata_cover': False,
'pretty_print': False,
'pubdate': None,
'publisher': None,
'rating': None,
'read_metadata_from_opf': None,
'remove_fake_margins': True,
'remove_first_image': False,
'remove_paragraph_spacing': False,
'remove_paragraph_spacing_indent_size': 1.5,
'renumber_headings': True,
'replace_scene_breaks': '',
'search_replace': None,
'series': None,
'series_index': None,
'share_not_sync': False,
'smarten_punctuation': False,
'sr1_replace': '',
'sr1_search': '',
'sr2_replace': '',
'sr2_search': '',
'sr3_replace': '',
'sr3_search': '',
'start_reading_at': None,
'subset_embedded_fonts': False,
'tags': None,
'test': False,
'timestamp': None,
'title': None,
'title_sort': None,
'toc_filter': None,
'toc_threshold': 6,
'toc_title': None,
'unsmarten_punctuation': False,
'unwrap_lines': True,
'use_auto_toc': False,
'verbose': 2}
InputFormatPlugin: Recipe Input running
Using custom recipe
Magazine
Python function terminated unexpectedly
'NoneType' object has no attribute 'a' (Error Code: 1)
Traceback (most recent call last):
File "site.py", line 132, in main
File "site.py", line 109, in run_entry_point
File "site-packages\calibre\utils\ipc\worker.py", line 186, in main
File "site-packages\calibre\gui2\convert\gui_conversion.py", line 25, in gui_convert
File "site-packages\calibre\ebooks\conversion\plumber.py", line 1009, in run
File "site-packages\calibre\customize\conversion.py", line 239, in __call__
File "site-packages\calibre\ebooks\conversion\plugins\recipe_ input.py", line 109, in convert
File "site-packages\calibre\web\feeds\news.py", line 891, in download
File "site-packages\calibre\web\feeds\news.py", line 1058, in build_index
File "<string>", line 44, in parse_index
AttributeError: 'NoneType' object has no attribute 'a'


Thanks for looking at it.

Cheers, Mixx
I have just tried it myself and it seems to be working fine. Are you updated to the most recent version of Calibre?
rainrdx is offline   Reply With Quote
Old 01-27-2013, 05:07 PM   #9
Mixx
Zealot
Mixx has a complete set of Star Wars action figures.Mixx has a complete set of Star Wars action figures.Mixx has a complete set of Star Wars action figures.Mixx has a complete set of Star Wars action figures.
 
Posts: 143
Karma: 387
Join Date: Sep 2010
Device: Kindle 3
Now I am and now it is working again. Thanks a million and I apologize for the bother. I should have tried the latest version first.

Thanxx, Mixx
Mixx is offline   Reply With Quote
Old 01-27-2013, 11:26 PM   #10
rainrdx
Connoisseur
rainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy blue
 
Posts: 55
Karma: 13316
Join Date: Jul 2012
Device: iPad
Quote:
Originally Posted by Mixx View Post
Now I am and now it is working again. Thanks a million and I apologize for the bother. I should have tried the latest version first.

Thanxx, Mixx
Not a problem at all. I'm glad it works, but do lemme know if it fails. I really wanna keep the recipe working
rainrdx is offline   Reply With Quote
Old 01-28-2013, 04:35 PM   #11
Mixx
Zealot
Mixx has a complete set of Star Wars action figures.Mixx has a complete set of Star Wars action figures.Mixx has a complete set of Star Wars action figures.Mixx has a complete set of Star Wars action figures.
 
Posts: 143
Karma: 387
Join Date: Sep 2010
Device: Kindle 3
Thank you for that, Rainrdx, much appreciated!
I certainly enjoy this recipe very much!

Thanxx for making it available!

Cheers, Mixx
Mixx is offline   Reply With Quote
Old 03-25-2013, 05:26 PM   #12
rainrdx
Connoisseur
rainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy blue
 
Posts: 55
Karma: 13316
Join Date: Jul 2012
Device: iPad
Update: Fixes the missing article issue

Code:
import re
from calibre.web.feeds.recipes import BasicNewsRecipe
from collections import OrderedDict

class BusinessWeekMagazine(BasicNewsRecipe):

    title       = 'Business Week Magazine'
    __author__  = 'Rick Shang'

    description = 'A renowned business publication. Business news, trends and profiles of successful businesspeople.'
    language = 'en'
    category = 'news'
    encoding = 'UTF-8'
    keep_only_tags = [
			dict(name='div', attrs={'id':'article_body_container'}),
			]
    remove_tags = [dict(name='ui'),dict(name='li'),dict(name='div', attrs={'id':['share-email']})]
    no_javascript = True
    no_stylesheets = True
	
    cover_url             = 'http://images.businessweek.com/mz/covers/current_120x160.jpg'

    def parse_index(self):

	#Go to the issue
        soup = self.index_to_soup('http://www.businessweek.com/magazine/news/articles/business_news.htm')
	
	#Find date
	mag=soup.find('h2',text='Magazine')
	self.log(mag)
	dates=self.tag_to_string(mag.findNext('h3'))
	self.timefmt = u' [%s]'%dates

        #Go to the main body
	div0 = soup.find ('div', attrs={'class':'column left'})	
	section_title = ''
        feeds = OrderedDict()
	for div in div0.findAll(['h4','h5']):
		articles = []
		section_title = self.tag_to_string(div.findPrevious('h3')).strip()
		title=self.tag_to_string(div.a).strip()
		url=div.a['href']
		soup0 = self.index_to_soup(url)
		urlprint=soup0.find('li', attrs={'class':'print tracked'}).a['href']
		articles.append({'title':title, 'url':urlprint, 'description':'', 'date':''})

		
		if articles:
			if section_title not in feeds:
				feeds[section_title] = []
			feeds[section_title] += articles
	div1 = soup.find ('div', attrs={'class':'column center'})	
	section_title = ''
	for div in div1.findAll(['h4','h5']):
		articles = []
		desc=self.tag_to_string(div.findNext('p')).strip()
		section_title = self.tag_to_string(div.findPrevious('h3')).strip()
		title=self.tag_to_string(div.a).strip()
		url=div.a['href']
		soup0 = self.index_to_soup(url)
		urlprint=soup0.find('li', attrs={'class':'print tracked'}).a['href']
		articles.append({'title':title, 'url':urlprint, 'description':desc, 'date':''})

		if articles:
			if section_title not in feeds:
				feeds[section_title] = []
			feeds[section_title] += articles
        ans = [(key, val) for key, val in feeds.iteritems()]
        return ans
rainrdx is offline   Reply With Quote
Old 04-05-2013, 09:39 PM   #13
rainrdx
Connoisseur
rainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy blue
 
Posts: 55
Karma: 13316
Join Date: Jul 2012
Device: iPad
Update: Fixes due to minor changes in the website.
Now I've remembered to use the most recent code.

Code:
from calibre.web.feeds.recipes import BasicNewsRecipe
from collections import OrderedDict

class BusinessWeekMagazine(BasicNewsRecipe):

    title       = 'Business Week Magazine'
    __author__  = 'Rick Shang'

    description = 'A renowned business publication. Business news, trends and profiles of successful businesspeople.'
    language = 'en'
    category = 'news'
    encoding = 'UTF-8'
    keep_only_tags = [
            dict(name='div', attrs={'id':'article_body_container'}),
            ]
    remove_tags = [dict(name='ui'),dict(name='li'),dict(name='div', attrs={'id':['share-email']})]
    no_javascript = True
    no_stylesheets = True

    cover_url             = 'http://images.businessweek.com/mz/covers/current_120x160.jpg'

    def parse_index(self):
        #Go to the issue
        soup = self.index_to_soup('http://www.businessweek.com/magazine/news/articles/business_news.htm')

        #Find date
        mag=soup.find('h2',text='Magazine')
        self.log(mag)
        dates=self.tag_to_string(mag.findNext('h3'))
        self.timefmt = u' [%s]'%dates

        #Go to the main body
        div0 = soup.find ('div', attrs={'class':'column left'})
        section_title = ''
        feeds = OrderedDict()
        for div in div0.findAll(['h4','h5']):
            articles = []
            section_title = self.tag_to_string(div.findPrevious('h3')).strip()
            title=self.tag_to_string(div.a).strip()
            url=div.a['href']
            soup0 = self.index_to_soup(url)
            urlprint=soup0.find('a', attrs={'href':re.compile('.*printer.*')})['href']
            articles.append({'title':title, 'url':urlprint, 'description':'', 'date':''})


            if articles:
                if section_title not in feeds:
                    feeds[section_title] = []
                feeds[section_title] += articles
        div1 = soup.find ('div', attrs={'class':'column center'})
        section_title = ''
        for div in div1.findAll(['h4','h5']):
            articles = []
            desc=self.tag_to_string(div.findNext('p')).strip()
            section_title = self.tag_to_string(div.findPrevious('h3')).strip()
            title=self.tag_to_string(div.a).strip()
            url=div.a['href']
            soup0 = self.index_to_soup(url)
            urlprint=soup0.find('a', attrs={'href':re.compile('.*printer.*')})['href']
            articles.append({'title':title, 'url':urlprint, 'description':desc, 'date':''})

            if articles:
                if section_title not in feeds:
                    feeds[section_title] = []
                feeds[section_title] += articles
        ans = [(key, val) for key, val in feeds.iteritems()]
        return ans
rainrdx is offline   Reply With Quote
Old 04-26-2013, 08:14 PM   #14
rainrdx
Connoisseur
rainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy blue
 
Posts: 55
Karma: 13316
Join Date: Jul 2012
Device: iPad
Update: fixes for the stupid website changes..

Code:
import re
from calibre.web.feeds.recipes import BasicNewsRecipe
from collections import OrderedDict

class BusinessWeekMagazine(BasicNewsRecipe):

    title       = 'Business Week Magazine'
    __author__  = 'Rick Shang'

    description = 'A renowned business publication. Business news, trends and profiles of successful businesspeople.'
    language = 'en'
    category = 'news'
    encoding = 'UTF-8'
    keep_only_tags = [
            dict(name='div', attrs={'id':['article_body_container','story_body']}),
            ]
    remove_tags = [dict(name='ui'),dict(name='li'),dict(name='div', attrs={'id':['share-email']})]
    no_javascript = True
    no_stylesheets = True

    cover_url             = 'http://images.businessweek.com/mz/covers/current_120x160.jpg'

    def parse_index(self):
        #Go to the issue
        soup = self.index_to_soup('http://www.businessweek.com/magazine/news/articles/business_news.htm')

        #Find date
        mag=soup.find('h2',text='Magazine')
        dates=self.tag_to_string(mag.findNext('h3'))
        self.timefmt = u' [%s]'%dates

        #Go to the main body
        div0 = soup.find ('div', attrs={'class':'column left'})
        section_title = ''
        feeds = OrderedDict()
        for div in div0.findAll('a', attrs={'class': None}):
            articles = []
            section_title = self.tag_to_string(div.findPrevious('h3')).strip()
            title=self.tag_to_string(div).strip()
            url=div['href']
	    soup0 = self.index_to_soup(url)
            urlprint=soup0.find('a', attrs={'href':re.compile('.*printer.*')})
	    if urlprint is not None:
		url=urlprint['href']
            articles.append({'title':title, 'url':url, 'description':'', 'date':''})


            if articles:
                if section_title not in feeds:
                    feeds[section_title] = []
		feeds[section_title] += articles
        div1 = soup.find ('div', attrs={'class':'column center'})
        section_title = ''
        for div in div1.findAll('a'):
            articles = []
            desc=self.tag_to_string(div.findNext('p')).strip()
            section_title = self.tag_to_string(div.findPrevious('h3')).strip()
            title=self.tag_to_string(div).strip()
            url=div['href']
            soup0 = self.index_to_soup(url)
            urlprint=soup0.find('a', attrs={'href':re.compile('.*printer.*')})
	    if urlprint is not None:
		url=urlprint['href']
            articles.append({'title':title, 'url':url, 'description':desc, 'date':''})
            if articles:
                if section_title not in feeds:
                    feeds[section_title] = []
		feeds[section_title] += articles


        ans = [(key, val) for key, val in feeds.iteritems()]
        return ans
rainrdx is offline   Reply With Quote
Old 09-06-2013, 06:48 AM   #15
garyzeb55
Junior Member
garyzeb55 began at the beginning.
 
Posts: 6
Karma: 10
Join Date: Mar 2013
Device: Kindle Touch
Not working again. Any help is greatly appreciated. Thanks.
garyzeb55 is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
The Week magazine anleva Recipes 5 01-01-2012 03:47 PM
(Business Week) Bookstores closed not because of poor sales Ryvyan General Discussions 18 11-27-2011 04:21 PM
Business Week is caotic after HTML5 article dino_hsu_1019 Recipes 0 08-13-2011 11:59 AM
Business Week - Cell Phones take on e-Readers =X= News 15 01-06-2009 11:09 AM
Business Week lukewarm on e-books VillageReader Lounge 1 08-29-2007 05:58 AM


All times are GMT -4. The time now is 11:20 AM.


MobileRead.com is a privately owned, operated and funded community.