View Single Post
Old 09-02-2012, 07:24 AM   #1
eroche
Junior Member
eroche began at the beginning.
 
Posts: 4
Karma: 10
Join Date: Sep 2012
Device: sony ereader
Help to debug and download pages from the web

Hi I am new to Calibre and I am trying to set up a recipe to download the headlines from a news site but my recipe is not working for me and i am at a loss as to how to debug it.
The logic I am using in my code is to open a webpage and get the headlines and their links using beautifulsoup then use calibre to get the content form these headline pages.
It seems to be nearly working but the output is a bit funny. The text on some of the pages of the generate ebook is cut and when I transfer it to my ebook all the articles are blank, it just shows the table of contents.
In terms of debugging I dont know how to debug the output. If I use the --test option I get output but I dont understand whats wrong still. Any help would be appreciated. My code is the following, and I have pasted the output from the command prompt below also:

Code:
from BeautifulSoup import BeautifulSoup
import urllib2
from calibre.web.feeds.news import BasicNewsRecipe

global recipeUrl
recipeUrl = []
global startUrl
startUrl = []

class RTE(BasicNewsRecipe):
	title = 'RTE links in Ebook Format'
	description = 'Headlines from Ireland and International'
	__author__  = 'Edward Roche'
	language = 'en'
	
	startUrl = [('http://www.rte.ie/news')]
			
	for link in startUrl:
		soup = BeautifulSoup(urllib2.urlopen(link).read())
		address= soup.find('div', {'id' : "more-headlines"})
		while address.findNext('dd'):
			address = address.findNext('dd')
			if not address.text == 'Full News Index':
				recipeUrl.append( (address.text , 'http://www.rte.ie' + address.a['href']))
	
	def parse_index(self):
		feeds = []
		articles = self.RTE_parse_section('')
		feeds.append(('News headlines', articles))
		return feeds
		
	def RTE_parse_section(self, link):
		current_articles = []
		for file in recipeUrl:
			current_articles.append({'title': file[0], 'url': file[1], 'description':'', 'date':''})
		return current_articles



Code:
D:\My Python Sample Code\Calibre Recipes>ebook-convert rte2.recipe rte.epub --te
st
1% Converting input to HTML...
InputFormatPlugin: Recipe Input running
1% Fetching feeds...
1% Got feeds from index page
1% Trying to download cover...
1% Generating masthead...
Synthesizing mastheadImage
1% Starting download [4 thread(s)]...
Traceback (most recent call last):
Traceback (most recent call last):
  File "site-packages\calibre\web\fetch\simple.py", line 371, in process_images
  File "site-packages\calibre\web\fetch\simple.py", line 371, in process_images
  File "site-packages\PIL\Image.py", line 1982, in open
  File "site-packages\PIL\Image.py", line 1982, in open
IOError: cannot identify image file
IOError: cannot identify image file
17% Article downloaded: Taoiseach congratulates Paralympic winners
34% Article downloaded: Daly leaves Socialist Party after Wallace row
34% Feeds downloaded to C:\Users\Edward\AppData\Local\Temp\calibre_0.8.50_tmp_rk
z6ug\dkga2y_plumber\index.html
34% Download finished
Parsing all content...
Forcing index.html into XHTML namespace
CSSStyleDeclaration: Unexpected token, ignoring upto u'*font-size:small;'. [6:2:
 *]
CSSStyleDeclaration: Unexpected token, ignoring upto u'*font:x-small;'. [8:2: *]

CSSStyleDeclaration: Unexpected token, ignoring upto u'*font-size:108%;'. [34:2:
 *]
Forcing feed_0/index.html into XHTML namespace
CSSStyleDeclaration: Unexpected token, ignoring upto u'*font-size:small;'. [6:2:
 *]
CSSStyleDeclaration: Unexpected token, ignoring upto u'*font:x-small;'. [8:2: *]

CSSStyleDeclaration: Unexpected token, ignoring upto u'*font-size:108%;'. [34:2:
 *]
Referenced file u'feed_0/article_0/images2010/bg_article_tab_group.jpg' not foun
d
Referenced file '/rtejr' not found
Referenced file u'feed_0/article_0/images2010/sprite_main.png' not found
Referenced file u'feed_0/article_1/images2010/bg_subhead_presidential_election_2
011.png' not found
Referenced file '/lifestyle/fashion' not found
Referenced file u'feed_0/article_1/recommend' not found
Referenced file '/news/ireland.html' not found
Referenced file '/about/en/policies-and-reports/policies-guidelines/2012/0417/31
7412-rte-data-protection-' not found
Referenced file u'feed_1/index.html' not found
Referenced file '/live' not found
Referenced file '/news/2012/0901/paralympics-gold.html' not found
Referenced file '/lotto' not found
Referenced file '/news/politics.html' not found
Referenced file u'feed_0/article_0/images2010/bg_audio_list_item.jpg' not found
Referenced file u'feed_0/article_0/images2010/bg_subhead_vote_2011.png' not foun
d
Referenced file '/news/business.html' not found
Referenced file u'feed_0/article_0/images2010/header_frontline.jpg' not found
Referenced file u'/news/images2010/news-top-referendum.png' not found
Referenced file '/lifestyle/motors' not found
Referenced file '/news/2012/0901/missing-sisters.html' not found
Referenced file '/about/en/information-and-feedback/contact-rte' not found
Referenced file '/about/en/information-and-feedback/complaints' not found
Referenced file '/news/election2011' not found
Referenced file '/ten/guide.html' not found
Referenced file '/news/player_live.html' not found
Referenced file '/player' not found
Referenced file '/aertel' not found
Referenced file '/lifestyle/food' not found
Referenced file '/shop' not found
Referenced file '/news/2012/0901/clare-daly-socialist-party.html' not found
Referenced file '/lifestyle' not found
Referenced file '/lifestyle/homes' not found
Referenced file '/news' not found
Referenced file '/news/index.html' not found
Referenced file '/dating' not found
Referenced file '/about/en/serving-our-audience/2012/0815/333706-terms-and-condi
tions-for-rte-ie' not found
Referenced file '/news/player.html' not found
Referenced file '/news/fiscal-treaty.html' not found
Referenced file u'feed_0/article_1/images2010/article_video_thumbnail_overlay.pn
g' not found
Referenced file '/news/2012/0901/mary-coughlan-husband-death.html' not found
Referenced file '/' not found
Referenced file u'feed_0/article_1/images2010/header_frontline.jpg' not found
Referenced file u'/news/images2010/sprite_main.png' not found
Referenced file '/business' not found
Referenced file u'feed_0/article_0/recommend' not found
Referenced file '/ten' not found
Referenced file u'/news/images2010/bg_video_container.gif' not found
Referenced file u'feed_0/article_0/images2010/article_audio_thumbnail_overlay.pn
g' not found
Referenced file '/sport' not found
Referenced file '/news/911' not found
Referenced file u'feed_0/article_0/images2010/bg_subhead_presidential_election_2
011.png' not found
Referenced file '/about/en/working-with-rte/vacancies' not found
Referenced file '/news/presidentialelection.html' not found
Referenced file '/about/en/policies-and-reports/policies-guidelines/2012/0417/31
7440-rte-privacy-statement' not found
Referenced file '/news/galleries.html' not found
Referenced file u'feed_0/article_1/stylesheets/bg_filter_icon_l.gif' not found
Referenced file '/about/en/how-rte-is-run/2012/0221/291618-the-license-fee' not
found
Referenced file u'feed_0/article_1/images2010/bg_audio_list_item.jpg' not found
Referenced file u'/news/images2010/player-loader.gif' not found
Referenced file '/news/vote2011' not found
Referenced file '/newsnow' not found
Referenced file '/extra/apps.html' not found
Referenced file '/archives' not found
Referenced file u'feed_0/article_1/images2010/sprite_main.png' not found
Referenced file '/about' not found
Referenced file u'feed_0/article_0/images2010/bg_breaking_news.jpg' not found
Referenced file '/news/2012/0901/american-football-tourists.html' not found
Referenced file '/tv' not found
Referenced file '/news/search_results.html%3fquery%3d%22clare%20daly%22' not fou
nd
Referenced file u'feed_0/article_1/images2010/september_11th_bg.png' not found
Referenced file u'feed_0/article_0/images2010/sprite_subscribe.png' not found
Referenced file '/jobs' not found
Referenced file '/about/en/working-with-rte/2012/0727/330809-advertisers' not fo
und
Referenced file '/news/world.html' not found
Referenced file u'feed_0/article_0/images2010/article_video_thumbnail_overlay.pn
g' not found
Referenced file '/about/en/policies-and-reports/annual-reports' not found
Referenced file u'feed_0/article_0/images2010/bg_nav_current.jpg' not found
Referenced file '/news/2012/0901/man-appears-before-special-sitting-of-dundalk-c
t.html' not found
Referenced file u'feed_0/article_0/images2010/bg_photo_count.png' not found
Referenced file '/news/2012/0901/croke-park-agreement.html' not found
Referenced file '/news/money/index.html' not found
Referenced file u'/news/election2011/images/masthead1.png' not found
Referenced file u'feed_0/article_1/images2010/bg_photo_count.png' not found
Referenced file u'feed_0/article_1/images2010/bg_nav_current.jpg' not found
Referenced file '/lifestyle/travel' not found
Referenced file u'feed_0/article_1/images2010/bg_subhead_vote_2011.png' not foun
d
Referenced file '/weather' not found
Referenced file '/news/business' not found
Referenced file u'feed_0/article_0/images2010/bg_header.jpg' not found
Referenced file u'/images/ajax-loader.gif' not found
Referenced file u'feed_0/article_0/images2010/september_11th_bg.png' not found
Referenced file '/emails' not found
Referenced file '/news/search_results.html%3fquery%3d%22paralympics%22' not foun
d
Referenced file '/radio' not found
Referenced file u'feed_0/article_1/images2010/bg_header.jpg' not found
Referenced file u'feed_0/article_1/images2010/article_audio_thumbnail_overlay.pn
g' not found
Referenced file '/news/special-reports.html' not found
Referenced file u'feed_0/article_1/images2010/sprite_subscribe.png' not found
Referenced file '/about/en/policies-and-reports/policies-guidelines/2012/0815/33
3711-user-comments-terms-of-use' not found
Referenced file '/trte' not found
Referenced file '/news/2012/0902/former-leaders-should-face-charges-over-iraq-tu
tu.html' not found
Referenced file u'feed_0/article_1/images2010/bg_article_tab_group.jpg' not foun
d
Referenced file u'feed_0/article_1/images2010/bg_breaking_news.jpg' not found
Referenced file '/performinggroups' not found
Referenced file u'feed_0/article_0/stylesheets/bg_filter_icon_l.gif' not found
Referenced file '/news/search_results.html%3fquery%3d%22enda%20kenny%22' not fou
nd
34% Running transforms on ebook...
Merging user specified metadata...
Detecting structure...
Flattening CSS and remapping font sizes...
Source base font size is 8.70480pt
Removing fake margins...
Removing level div_1 left margin of: auto
Removing level div_1 right margin of: auto
Removing level div_2 left margin of: 7px
Removing level div_2 right margin of: 7px
Cleaning up manifest...
Trimming unused files from manifest...
Trimming u'feed_0/article_0/images/img22.jpg' from manifest
Trimming u'feed_0/article_0/images/img1.jpg' from manifest
Trimming u'feed_0/article_0/images/img21.jpg' from manifest
Trimming u'feed_0/article_1/images/img11.jpg' from manifest
Trimming u'feed_0/article_1/images/img1.jpg' from manifest
Trimming u'feed_0/article_1/images/img31.jpg' from manifest
Trimming u'feed_0/article_0/images/img2.jpg' from manifest
Trimming u'feed_0/article_1/images/img32.jpg' from manifest
Trimming u'feed_0/article_1/images/img2.jpg' from manifest
Trimming u'feed_0/article_0/images/img4.jpg' from manifest
Creating EPUB Output...
67% Creating EPUB Output
Found non-unique filenames, renaming to support broken EPUB readers like FBReade
r, Aldiko and Stanza...
Splitting markup on page breaks and flow limits, if any...
        Looking for large trees in feed_0/article_0/index.html...
        No large trees found
        Looking for large trees in index_u2.html...
        No large trees found
        Looking for large trees in feed_0/index_u3.html...
        No large trees found
        Looking for large trees in feed_0/article_1/index_u1.html...
        No large trees found
The cover image has an id != "cover". Renaming to work around bug in Nook Color
EPUB output written to D:\My Python Sample Code\Calibre Recipes\rte.epub
Output saved to   D:\My Python Sample Code\Calibre Recipes\rte.epub
eroche is offline   Reply With Quote