Help to debug and download pages from the web

eroche · 09-02-2012, 06:24 AM

Hi I am new to Calibre and I am trying to set up a recipe to download the headlines from a news site but my recipe is not working for me and i am at a loss as to how to debug it.
The logic I am using in my code is to open a webpage and get the headlines and their links using beautifulsoup then use calibre to get the content form these headline pages.
It seems to be nearly working but the output is a bit funny. The text on some of the pages of the generate ebook is cut and when I transfer it to my ebook all the articles are blank, it just shows the table of contents.
In terms of debugging I dont know how to debug the output. If I use the --test option I get output but I dont understand whats wrong still. Any help would be appreciated. My code is the following, and I have pasted the output from the command prompt below also:

Code:

from BeautifulSoup import BeautifulSoup
import urllib2
from calibre.web.feeds.news import BasicNewsRecipe

global recipeUrl
recipeUrl = []
global startUrl
startUrl = []

class RTE(BasicNewsRecipe):
	title = 'RTE links in Ebook Format'
	description = 'Headlines from Ireland and International'
	__author__  = 'Edward Roche'
	language = 'en'
	
	startUrl = [('http://www.rte.ie/news')]
			
	for link in startUrl:
		soup = BeautifulSoup(urllib2.urlopen(link).read())
		address= soup.find('div', {'id' : "more-headlines"})
		while address.findNext('dd'):
			address = address.findNext('dd')
			if not address.text == 'Full News Index':
				recipeUrl.append( (address.text , 'http://www.rte.ie' + address.a['href']))
	
	def parse_index(self):
		feeds = []
		articles = self.RTE_parse_section('')
		feeds.append(('News headlines', articles))
		return feeds
		
	def RTE_parse_section(self, link):
		current_articles = []
		for file in recipeUrl:
			current_articles.append({'title': file[0], 'url': file[1], 'description':'', 'date':''})
		return current_articles

Code:

D:\My Python Sample Code\Calibre Recipes>ebook-convert rte2.recipe rte.epub --te
st
1% Converting input to HTML...
InputFormatPlugin: Recipe Input running
1% Fetching feeds...
1% Got feeds from index page
1% Trying to download cover...
1% Generating masthead...
Synthesizing mastheadImage
1% Starting download [4 thread(s)]...
Traceback (most recent call last):
Traceback (most recent call last):
  File "site-packages\calibre\web\fetch\simple.py", line 371, in process_images
  File "site-packages\calibre\web\fetch\simple.py", line 371, in process_images
  File "site-packages\PIL\Image.py", line 1982, in open
  File "site-packages\PIL\Image.py", line 1982, in open
IOError: cannot identify image file
IOError: cannot identify image file
17% Article downloaded: Taoiseach congratulates Paralympic winners
34% Article downloaded: Daly leaves Socialist Party after Wallace row
34% Feeds downloaded to C:\Users\Edward\AppData\Local\Temp\calibre_0.8.50_tmp_rk
z6ug\dkga2y_plumber\index.html
34% Download finished
Parsing all content...
Forcing index.html into XHTML namespace
CSSStyleDeclaration: Unexpected token, ignoring upto u'*font-size:small;'. [6:2:
 *]
CSSStyleDeclaration: Unexpected token, ignoring upto u'*font:x-small;'. [8:2: *]

CSSStyleDeclaration: Unexpected token, ignoring upto u'*font-size:108%;'. [34:2:
 *]
Forcing feed_0/index.html into XHTML namespace
CSSStyleDeclaration: Unexpected token, ignoring upto u'*font-size:small;'. [6:2:
 *]
CSSStyleDeclaration: Unexpected token, ignoring upto u'*font:x-small;'. [8:2: *]

CSSStyleDeclaration: Unexpected token, ignoring upto u'*font-size:108%;'. [34:2:
 *]
Referenced file u'feed_0/article_0/images2010/bg_article_tab_group.jpg' not foun
d
Referenced file '/rtejr' not found
Referenced file u'feed_0/article_0/images2010/sprite_main.png' not found
Referenced file u'feed_0/article_1/images2010/bg_subhead_presidential_election_2
011.png' not found
Referenced file '/lifestyle/fashion' not found
Referenced file u'feed_0/article_1/recommend' not found
Referenced file '/news/ireland.html' not found
Referenced file '/about/en/policies-and-reports/policies-guidelines/2012/0417/31
7412-rte-data-protection-' not found
Referenced file u'feed_1/index.html' not found
Referenced file '/live' not found
Referenced file '/news/2012/0901/paralympics-gold.html' not found
Referenced file '/lotto' not found
Referenced file '/news/politics.html' not found
Referenced file u'feed_0/article_0/images2010/bg_audio_list_item.jpg' not found
Referenced file u'feed_0/article_0/images2010/bg_subhead_vote_2011.png' not foun
d
Referenced file '/news/business.html' not found
Referenced file u'feed_0/article_0/images2010/header_frontline.jpg' not found
Referenced file u'/news/images2010/news-top-referendum.png' not found
Referenced file '/lifestyle/motors' not found
Referenced file '/news/2012/0901/missing-sisters.html' not found
Referenced file '/about/en/information-and-feedback/contact-rte' not found
Referenced file '/about/en/information-and-feedback/complaints' not found
Referenced file '/news/election2011' not found
Referenced file '/ten/guide.html' not found
Referenced file '/news/player_live.html' not found
Referenced file '/player' not found
Referenced file '/aertel' not found
Referenced file '/lifestyle/food' not found
Referenced file '/shop' not found
Referenced file '/news/2012/0901/clare-daly-socialist-party.html' not found
Referenced file '/lifestyle' not found
Referenced file '/lifestyle/homes' not found
Referenced file '/news' not found
Referenced file '/news/index.html' not found
Referenced file '/dating' not found
Referenced file '/about/en/serving-our-audience/2012/0815/333706-terms-and-condi
tions-for-rte-ie' not found
Referenced file '/news/player.html' not found
Referenced file '/news/fiscal-treaty.html' not found
Referenced file u'feed_0/article_1/images2010/article_video_thumbnail_overlay.pn
g' not found
Referenced file '/news/2012/0901/mary-coughlan-husband-death.html' not found
Referenced file '/' not found
Referenced file u'feed_0/article_1/images2010/header_frontline.jpg' not found
Referenced file u'/news/images2010/sprite_main.png' not found
Referenced file '/business' not found
Referenced file u'feed_0/article_0/recommend' not found
Referenced file '/ten' not found
Referenced file u'/news/images2010/bg_video_container.gif' not found
Referenced file u'feed_0/article_0/images2010/article_audio_thumbnail_overlay.pn
g' not found
Referenced file '/sport' not found
Referenced file '/news/911' not found
Referenced file u'feed_0/article_0/images2010/bg_subhead_presidential_election_2
011.png' not found
Referenced file '/about/en/working-with-rte/vacancies' not found
Referenced file '/news/presidentialelection.html' not found
Referenced file '/about/en/policies-and-reports/policies-guidelines/2012/0417/31
7440-rte-privacy-statement' not found
Referenced file '/news/galleries.html' not found
Referenced file u'feed_0/article_1/stylesheets/bg_filter_icon_l.gif' not found
Referenced file '/about/en/how-rte-is-run/2012/0221/291618-the-license-fee' not
found
Referenced file u'feed_0/article_1/images2010/bg_audio_list_item.jpg' not found
Referenced file u'/news/images2010/player-loader.gif' not found
Referenced file '/news/vote2011' not found
Referenced file '/newsnow' not found
Referenced file '/extra/apps.html' not found
Referenced file '/archives' not found
Referenced file u'feed_0/article_1/images2010/sprite_main.png' not found
Referenced file '/about' not found
Referenced file u'feed_0/article_0/images2010/bg_breaking_news.jpg' not found
Referenced file '/news/2012/0901/american-football-tourists.html' not found
Referenced file '/tv' not found
Referenced file '/news/search_results.html%3fquery%3d%22clare%20daly%22' not fou
nd
Referenced file u'feed_0/article_1/images2010/september_11th_bg.png' not found
Referenced file u'feed_0/article_0/images2010/sprite_subscribe.png' not found
Referenced file '/jobs' not found
Referenced file '/about/en/working-with-rte/2012/0727/330809-advertisers' not fo
und
Referenced file '/news/world.html' not found
Referenced file u'feed_0/article_0/images2010/article_video_thumbnail_overlay.pn
g' not found
Referenced file '/about/en/policies-and-reports/annual-reports' not found
Referenced file u'feed_0/article_0/images2010/bg_nav_current.jpg' not found
Referenced file '/news/2012/0901/man-appears-before-special-sitting-of-dundalk-c
t.html' not found
Referenced file u'feed_0/article_0/images2010/bg_photo_count.png' not found
Referenced file '/news/2012/0901/croke-park-agreement.html' not found
Referenced file '/news/money/index.html' not found
Referenced file u'/news/election2011/images/masthead1.png' not found
Referenced file u'feed_0/article_1/images2010/bg_photo_count.png' not found
Referenced file u'feed_0/article_1/images2010/bg_nav_current.jpg' not found
Referenced file '/lifestyle/travel' not found
Referenced file u'feed_0/article_1/images2010/bg_subhead_vote_2011.png' not foun
d
Referenced file '/weather' not found
Referenced file '/news/business' not found
Referenced file u'feed_0/article_0/images2010/bg_header.jpg' not found
Referenced file u'/images/ajax-loader.gif' not found
Referenced file u'feed_0/article_0/images2010/september_11th_bg.png' not found
Referenced file '/emails' not found
Referenced file '/news/search_results.html%3fquery%3d%22paralympics%22' not foun
d
Referenced file '/radio' not found
Referenced file u'feed_0/article_1/images2010/bg_header.jpg' not found
Referenced file u'feed_0/article_1/images2010/article_audio_thumbnail_overlay.pn
g' not found
Referenced file '/news/special-reports.html' not found
Referenced file u'feed_0/article_1/images2010/sprite_subscribe.png' not found
Referenced file '/about/en/policies-and-reports/policies-guidelines/2012/0815/33
3711-user-comments-terms-of-use' not found
Referenced file '/trte' not found
Referenced file '/news/2012/0902/former-leaders-should-face-charges-over-iraq-tu
tu.html' not found
Referenced file u'feed_0/article_1/images2010/bg_article_tab_group.jpg' not foun
d
Referenced file u'feed_0/article_1/images2010/bg_breaking_news.jpg' not found
Referenced file '/performinggroups' not found
Referenced file u'feed_0/article_0/stylesheets/bg_filter_icon_l.gif' not found
Referenced file '/news/search_results.html%3fquery%3d%22enda%20kenny%22' not fou
nd
34% Running transforms on ebook...
Merging user specified metadata...
Detecting structure...
Flattening CSS and remapping font sizes...
Source base font size is 8.70480pt
Removing fake margins...
Removing level div_1 left margin of: auto
Removing level div_1 right margin of: auto
Removing level div_2 left margin of: 7px
Removing level div_2 right margin of: 7px
Cleaning up manifest...
Trimming unused files from manifest...
Trimming u'feed_0/article_0/images/img22.jpg' from manifest
Trimming u'feed_0/article_0/images/img1.jpg' from manifest
Trimming u'feed_0/article_0/images/img21.jpg' from manifest
Trimming u'feed_0/article_1/images/img11.jpg' from manifest
Trimming u'feed_0/article_1/images/img1.jpg' from manifest
Trimming u'feed_0/article_1/images/img31.jpg' from manifest
Trimming u'feed_0/article_0/images/img2.jpg' from manifest
Trimming u'feed_0/article_1/images/img32.jpg' from manifest
Trimming u'feed_0/article_1/images/img2.jpg' from manifest
Trimming u'feed_0/article_0/images/img4.jpg' from manifest
Creating EPUB Output...
67% Creating EPUB Output
Found non-unique filenames, renaming to support broken EPUB readers like FBReade
r, Aldiko and Stanza...
Splitting markup on page breaks and flow limits, if any...
        Looking for large trees in feed_0/article_0/index.html...
        No large trees found
        Looking for large trees in index_u2.html...
        No large trees found
        Looking for large trees in feed_0/index_u3.html...
        No large trees found
        Looking for large trees in feed_0/article_1/index_u1.html...
        No large trees found
The cover image has an id != "cover". Renaming to work around bug in Nook Color
EPUB output written to D:\My Python Sample Code\Calibre Recipes\rte.epub
Output saved to   D:\My Python Sample Code\Calibre Recipes\rte.epub

eroche · 09-03-2012, 10:30 AM

I figured out what was wrong. The links were actually downloading the whole page and this was causing problems when converting to the epub format form html. It looked fine in the html debugging session but all the extra junk was causing it a problem on my reader. I added some keep_only_tags and remove_tags to my recipe to get it working. i will continue to work on the recipe and when its in good shape I will post the working version here.

scissors · 09-05-2012, 01:55 PM

That seems like a lot of hard work when calibre produces a pretty good output with

Spoiler:

eroche · 09-07-2012, 03:13 AM

Thanks Scissors, I know it does I guess its my way of learning how calibre works. I wanted to parse the links myself from the webpage so that I can do some testing for duplicates etc (when I combine multiple rss feeds) and manually identify which ones I want to include. My latest version for the RTE website is:

Spoiler:

This recipe extracts all the text from the news, business and sport rss feeds. It ignores the pictures as they are difficult to handle from this site.

09-07-2012, 03:13 AM	#4
eroche Junior Member Posts: 4 Karma: 10 Join Date: Sep 2012 Device: sony ereader	Thanks Scissors, I know it does I guess its my way of learning how calibre works. I wanted to parse the links myself from the webpage so that I can do some testing for duplicates etc (when I combine multiple rss feeds) and manually identify which ones I want to include. My latest version for the RTE website is: Spoiler: #The following recipe extracts the text from all the RSS articles that are linked. The photos on the RTE website do not lend themselves to being included in a recipe from BeautifulSoup import BeautifulSoup import urllib2 from calibre.web.feeds.news import BasicNewsRecipe from calibre.web.feeds.feedparser import parse #import sys global newsUrl newsUrl = [] global sportUrl sportUrl = [] global businessUrl businessUrl = [] global masterUrl masterUrl = [] class RTE(BasicNewsRecipe): title = 'RTE (Ireland)' description = 'Morning Newspaper from Ireland' __author__ = 'Edward Roche' language = 'en' oldest_article = 1.0 #Start by getting the rss feeds and saving them in lists by section #News Headlines entries = parse('http://www.rte.ie/rss/news.xml').entries for i, item in enumerate(entries): feedtitle = item.get('title') link = item.get('link') description = item.get('description') author = item.get('author') date = item.get('date') newsUrl.append( ( feedtitle , link, date)) masterUrl.append(link) #Business Headlines entries = parse('http://www.rte.ie/rss/business.xml').entries for i, item in enumerate(entries): feedtitle = item.get('title') link = item.get('link') description = item.get('description') author = item.get('author') date = item.get('date') duplicateInd = False for i in masterUrl: if link == i: duplicateInd = True print "duplicate found =, ", link if duplicateInd == False: businessUrl.append( ( feedtitle , link, date)) masterUrl.append(link) #Sports Headlines entries = parse('http://www.rte.ie/rss/sport.xml').entries for i, item in enumerate(entries): feedtitle = item.get('title') link = item.get('link') description = item.get('description') author = item.get('author') date = item.get('date') duplicateInd = False for i in masterUrl: if link == i: duplicateInd = True print "duplicate found =, ", link if duplicateInd == False: sportUrl.append( ( feedtitle , link, date)) masterUrl.append(link) #The saved lists will each make up an article group in the ebook. For each article group add the headins to the TOC def parse_index(self): feeds = [] articles = self.RTE_parse_section(newsUrl) feeds.append(('News Headlines', articles)) articles = self.RTE_parse_section(businessUrl) feeds.append(('Business Headlines', articles)) articles = self.RTE_parse_section(sportUrl) feeds.append(('Sport Headlines', articles)) return feeds #Each article group will be made up of articles, set up the articles based on the URLS that we have already gotten def RTE_parse_section(self, link): current_articles = [] for file in link: current_articles.append({'title': file[0], 'url': file[1], 'description':'', 'date':file[2]}) return current_articles #Clean up the output keep_only_tags = [ dict(name='div',attrs={'id': ['news-article-container']}) #,dict(name='article',attrs={'class': ['rte-sport-article']}) ,dict(name='div',attrs={'class': ['rte_gr_8']}) ] remove_tags_after = [ dict(name='ul',attrs={'class': 'keywords'}) ,dict(name='p',attrs={'class': 'sticky-footer-leadin'}) ,dict(name='div',attrs={'id': 'storyBody'}) ] remove_tags = [ dict(name='ul',attrs={'class': 'keywords'}) ,dict(name='div',attrs={'id': ['user-options-top','tab-group','related','photography','user-options-bottom']}) ,dict(name='div',attrs={'class': ['clear','photo-count','thumbnails','news-gallery-regular','side-content multimedia video','side-content multimedia audio']}) ,dict(name='a',attrs={'class': ['photo-prev','photo-next']}) , dict(name='meta') , dict(name='link') , dict(name='script') ,dict(name='figure') ,dict(name='p',attrs={'class': 'sticky-footer-leadin'}) ,dict(name='section',attrs={'id': 'article-media-box'}) ,dict(name='footer',attrs={'class': 'clearfix'}) ,dict(name='nav',attrs={'id': 'breadcrumb'}) ] no_stylesheets = True extra_css = ''' body { #color: rgb(0,0,0); #background-color:rgb(174,174,174); text-align:justify; line-spacing:1.8; #margin-top:0px; #margin-bottom:4px; #margin-right:50px; #margin-left:50px; #text-indent:2em; } h1, h2, h3, h4, h5, h6 { #color:white; text-align:center; font-style:italic; font-weight:bold; } p { text-align:left; } ul{ list-style: none } li { list-style: none padding-top:5px; } img { } ''' def preprocess_html(self, soup): #outputFile = 'D:\My Python Sample Code\Calibre Recipes\RTE\RawSoup\output'+soup.title.string+'.ht ml' #print "out " +outputFile #if 'Final Countdown' in soup.title.string: # sys.exit() #f = open(outputFile,"w") #f.write(soup.prettify()) #f.close() for alink in soup.findAll('a'): if alink.string is not None: tstr = alink.string alink.replaceWith(tstr) return soup def get_cover_url(self): url = 'http://dramafestival.ie/index.php_files/images/RTE%20logo.gif' return url This recipe extracts all the text from the news, business and sport rss feeds. It ignores the pictures as they are difficult to handle from this site. Last edited by eroche; 09-07-2012 at 06:03 AM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Opus Web pages	obsalys	Bookeen	2	09-06-2011 09:56 AM
A New Way to Read Web Pages on Your Kindle	jsingleton	Amazon Kindle	11	12-18-2009 03:20 AM
Web pages on the DR1000	allovertheglobe	iRex	0	10-12-2008 03:40 PM
Web Pages	andyafro	Sony Reader	0	11-05-2007 09:57 AM
Web pages to Reader?	Moadib	Sony Reader	17	01-10-2007 11:46 AM

09-03-2012, 10:30 AM	#2
eroche Junior Member Posts: 4 Karma: 10 Join Date: Sep 2012 Device: sony ereader	I figured out what was wrong. The links were actually downloading the whole page and this was causing problems when converting to the epub format form html. It looked fine in the html debugging session but all the extra junk was causing it a problem on my reader. I added some keep_only_tags and remove_tags to my recipe to get it working. i will continue to work on the recipe and when its in good shape I will post the working version here.

Advert