Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 09-02-2012, 06:24 AM   #1
eroche
Junior Member
eroche began at the beginning.
 
Posts: 4
Karma: 10
Join Date: Sep 2012
Device: sony ereader
Help to debug and download pages from the web

Hi I am new to Calibre and I am trying to set up a recipe to download the headlines from a news site but my recipe is not working for me and i am at a loss as to how to debug it.
The logic I am using in my code is to open a webpage and get the headlines and their links using beautifulsoup then use calibre to get the content form these headline pages.
It seems to be nearly working but the output is a bit funny. The text on some of the pages of the generate ebook is cut and when I transfer it to my ebook all the articles are blank, it just shows the table of contents.
In terms of debugging I dont know how to debug the output. If I use the --test option I get output but I dont understand whats wrong still. Any help would be appreciated. My code is the following, and I have pasted the output from the command prompt below also:

Code:
from BeautifulSoup import BeautifulSoup
import urllib2
from calibre.web.feeds.news import BasicNewsRecipe

global recipeUrl
recipeUrl = []
global startUrl
startUrl = []

class RTE(BasicNewsRecipe):
	title = 'RTE links in Ebook Format'
	description = 'Headlines from Ireland and International'
	__author__  = 'Edward Roche'
	language = 'en'
	
	startUrl = [('http://www.rte.ie/news')]
			
	for link in startUrl:
		soup = BeautifulSoup(urllib2.urlopen(link).read())
		address= soup.find('div', {'id' : "more-headlines"})
		while address.findNext('dd'):
			address = address.findNext('dd')
			if not address.text == 'Full News Index':
				recipeUrl.append( (address.text , 'http://www.rte.ie' + address.a['href']))
	
	def parse_index(self):
		feeds = []
		articles = self.RTE_parse_section('')
		feeds.append(('News headlines', articles))
		return feeds
		
	def RTE_parse_section(self, link):
		current_articles = []
		for file in recipeUrl:
			current_articles.append({'title': file[0], 'url': file[1], 'description':'', 'date':''})
		return current_articles



Code:
D:\My Python Sample Code\Calibre Recipes>ebook-convert rte2.recipe rte.epub --te
st
1% Converting input to HTML...
InputFormatPlugin: Recipe Input running
1% Fetching feeds...
1% Got feeds from index page
1% Trying to download cover...
1% Generating masthead...
Synthesizing mastheadImage
1% Starting download [4 thread(s)]...
Traceback (most recent call last):
Traceback (most recent call last):
  File "site-packages\calibre\web\fetch\simple.py", line 371, in process_images
  File "site-packages\calibre\web\fetch\simple.py", line 371, in process_images
  File "site-packages\PIL\Image.py", line 1982, in open
  File "site-packages\PIL\Image.py", line 1982, in open
IOError: cannot identify image file
IOError: cannot identify image file
17% Article downloaded: Taoiseach congratulates Paralympic winners
34% Article downloaded: Daly leaves Socialist Party after Wallace row
34% Feeds downloaded to C:\Users\Edward\AppData\Local\Temp\calibre_0.8.50_tmp_rk
z6ug\dkga2y_plumber\index.html
34% Download finished
Parsing all content...
Forcing index.html into XHTML namespace
CSSStyleDeclaration: Unexpected token, ignoring upto u'*font-size:small;'. [6:2:
 *]
CSSStyleDeclaration: Unexpected token, ignoring upto u'*font:x-small;'. [8:2: *]

CSSStyleDeclaration: Unexpected token, ignoring upto u'*font-size:108%;'. [34:2:
 *]
Forcing feed_0/index.html into XHTML namespace
CSSStyleDeclaration: Unexpected token, ignoring upto u'*font-size:small;'. [6:2:
 *]
CSSStyleDeclaration: Unexpected token, ignoring upto u'*font:x-small;'. [8:2: *]

CSSStyleDeclaration: Unexpected token, ignoring upto u'*font-size:108%;'. [34:2:
 *]
Referenced file u'feed_0/article_0/images2010/bg_article_tab_group.jpg' not foun
d
Referenced file '/rtejr' not found
Referenced file u'feed_0/article_0/images2010/sprite_main.png' not found
Referenced file u'feed_0/article_1/images2010/bg_subhead_presidential_election_2
011.png' not found
Referenced file '/lifestyle/fashion' not found
Referenced file u'feed_0/article_1/recommend' not found
Referenced file '/news/ireland.html' not found
Referenced file '/about/en/policies-and-reports/policies-guidelines/2012/0417/31
7412-rte-data-protection-' not found
Referenced file u'feed_1/index.html' not found
Referenced file '/live' not found
Referenced file '/news/2012/0901/paralympics-gold.html' not found
Referenced file '/lotto' not found
Referenced file '/news/politics.html' not found
Referenced file u'feed_0/article_0/images2010/bg_audio_list_item.jpg' not found
Referenced file u'feed_0/article_0/images2010/bg_subhead_vote_2011.png' not foun
d
Referenced file '/news/business.html' not found
Referenced file u'feed_0/article_0/images2010/header_frontline.jpg' not found
Referenced file u'/news/images2010/news-top-referendum.png' not found
Referenced file '/lifestyle/motors' not found
Referenced file '/news/2012/0901/missing-sisters.html' not found
Referenced file '/about/en/information-and-feedback/contact-rte' not found
Referenced file '/about/en/information-and-feedback/complaints' not found
Referenced file '/news/election2011' not found
Referenced file '/ten/guide.html' not found
Referenced file '/news/player_live.html' not found
Referenced file '/player' not found
Referenced file '/aertel' not found
Referenced file '/lifestyle/food' not found
Referenced file '/shop' not found
Referenced file '/news/2012/0901/clare-daly-socialist-party.html' not found
Referenced file '/lifestyle' not found
Referenced file '/lifestyle/homes' not found
Referenced file '/news' not found
Referenced file '/news/index.html' not found
Referenced file '/dating' not found
Referenced file '/about/en/serving-our-audience/2012/0815/333706-terms-and-condi
tions-for-rte-ie' not found
Referenced file '/news/player.html' not found
Referenced file '/news/fiscal-treaty.html' not found
Referenced file u'feed_0/article_1/images2010/article_video_thumbnail_overlay.pn
g' not found
Referenced file '/news/2012/0901/mary-coughlan-husband-death.html' not found
Referenced file '/' not found
Referenced file u'feed_0/article_1/images2010/header_frontline.jpg' not found
Referenced file u'/news/images2010/sprite_main.png' not found
Referenced file '/business' not found
Referenced file u'feed_0/article_0/recommend' not found
Referenced file '/ten' not found
Referenced file u'/news/images2010/bg_video_container.gif' not found
Referenced file u'feed_0/article_0/images2010/article_audio_thumbnail_overlay.pn
g' not found
Referenced file '/sport' not found
Referenced file '/news/911' not found
Referenced file u'feed_0/article_0/images2010/bg_subhead_presidential_election_2
011.png' not found
Referenced file '/about/en/working-with-rte/vacancies' not found
Referenced file '/news/presidentialelection.html' not found
Referenced file '/about/en/policies-and-reports/policies-guidelines/2012/0417/31
7440-rte-privacy-statement' not found
Referenced file '/news/galleries.html' not found
Referenced file u'feed_0/article_1/stylesheets/bg_filter_icon_l.gif' not found
Referenced file '/about/en/how-rte-is-run/2012/0221/291618-the-license-fee' not
found
Referenced file u'feed_0/article_1/images2010/bg_audio_list_item.jpg' not found
Referenced file u'/news/images2010/player-loader.gif' not found
Referenced file '/news/vote2011' not found
Referenced file '/newsnow' not found
Referenced file '/extra/apps.html' not found
Referenced file '/archives' not found
Referenced file u'feed_0/article_1/images2010/sprite_main.png' not found
Referenced file '/about' not found
Referenced file u'feed_0/article_0/images2010/bg_breaking_news.jpg' not found
Referenced file '/news/2012/0901/american-football-tourists.html' not found
Referenced file '/tv' not found
Referenced file '/news/search_results.html%3fquery%3d%22clare%20daly%22' not fou
nd
Referenced file u'feed_0/article_1/images2010/september_11th_bg.png' not found
Referenced file u'feed_0/article_0/images2010/sprite_subscribe.png' not found
Referenced file '/jobs' not found
Referenced file '/about/en/working-with-rte/2012/0727/330809-advertisers' not fo
und
Referenced file '/news/world.html' not found
Referenced file u'feed_0/article_0/images2010/article_video_thumbnail_overlay.pn
g' not found
Referenced file '/about/en/policies-and-reports/annual-reports' not found
Referenced file u'feed_0/article_0/images2010/bg_nav_current.jpg' not found
Referenced file '/news/2012/0901/man-appears-before-special-sitting-of-dundalk-c
t.html' not found
Referenced file u'feed_0/article_0/images2010/bg_photo_count.png' not found
Referenced file '/news/2012/0901/croke-park-agreement.html' not found
Referenced file '/news/money/index.html' not found
Referenced file u'/news/election2011/images/masthead1.png' not found
Referenced file u'feed_0/article_1/images2010/bg_photo_count.png' not found
Referenced file u'feed_0/article_1/images2010/bg_nav_current.jpg' not found
Referenced file '/lifestyle/travel' not found
Referenced file u'feed_0/article_1/images2010/bg_subhead_vote_2011.png' not foun
d
Referenced file '/weather' not found
Referenced file '/news/business' not found
Referenced file u'feed_0/article_0/images2010/bg_header.jpg' not found
Referenced file u'/images/ajax-loader.gif' not found
Referenced file u'feed_0/article_0/images2010/september_11th_bg.png' not found
Referenced file '/emails' not found
Referenced file '/news/search_results.html%3fquery%3d%22paralympics%22' not foun
d
Referenced file '/radio' not found
Referenced file u'feed_0/article_1/images2010/bg_header.jpg' not found
Referenced file u'feed_0/article_1/images2010/article_audio_thumbnail_overlay.pn
g' not found
Referenced file '/news/special-reports.html' not found
Referenced file u'feed_0/article_1/images2010/sprite_subscribe.png' not found
Referenced file '/about/en/policies-and-reports/policies-guidelines/2012/0815/33
3711-user-comments-terms-of-use' not found
Referenced file '/trte' not found
Referenced file '/news/2012/0902/former-leaders-should-face-charges-over-iraq-tu
tu.html' not found
Referenced file u'feed_0/article_1/images2010/bg_article_tab_group.jpg' not foun
d
Referenced file u'feed_0/article_1/images2010/bg_breaking_news.jpg' not found
Referenced file '/performinggroups' not found
Referenced file u'feed_0/article_0/stylesheets/bg_filter_icon_l.gif' not found
Referenced file '/news/search_results.html%3fquery%3d%22enda%20kenny%22' not fou
nd
34% Running transforms on ebook...
Merging user specified metadata...
Detecting structure...
Flattening CSS and remapping font sizes...
Source base font size is 8.70480pt
Removing fake margins...
Removing level div_1 left margin of: auto
Removing level div_1 right margin of: auto
Removing level div_2 left margin of: 7px
Removing level div_2 right margin of: 7px
Cleaning up manifest...
Trimming unused files from manifest...
Trimming u'feed_0/article_0/images/img22.jpg' from manifest
Trimming u'feed_0/article_0/images/img1.jpg' from manifest
Trimming u'feed_0/article_0/images/img21.jpg' from manifest
Trimming u'feed_0/article_1/images/img11.jpg' from manifest
Trimming u'feed_0/article_1/images/img1.jpg' from manifest
Trimming u'feed_0/article_1/images/img31.jpg' from manifest
Trimming u'feed_0/article_0/images/img2.jpg' from manifest
Trimming u'feed_0/article_1/images/img32.jpg' from manifest
Trimming u'feed_0/article_1/images/img2.jpg' from manifest
Trimming u'feed_0/article_0/images/img4.jpg' from manifest
Creating EPUB Output...
67% Creating EPUB Output
Found non-unique filenames, renaming to support broken EPUB readers like FBReade
r, Aldiko and Stanza...
Splitting markup on page breaks and flow limits, if any...
        Looking for large trees in feed_0/article_0/index.html...
        No large trees found
        Looking for large trees in index_u2.html...
        No large trees found
        Looking for large trees in feed_0/index_u3.html...
        No large trees found
        Looking for large trees in feed_0/article_1/index_u1.html...
        No large trees found
The cover image has an id != "cover". Renaming to work around bug in Nook Color
EPUB output written to D:\My Python Sample Code\Calibre Recipes\rte.epub
Output saved to   D:\My Python Sample Code\Calibre Recipes\rte.epub
eroche is offline   Reply With Quote
Old 09-03-2012, 10:30 AM   #2
eroche
Junior Member
eroche began at the beginning.
 
Posts: 4
Karma: 10
Join Date: Sep 2012
Device: sony ereader
I figured out what was wrong. The links were actually downloading the whole page and this was causing problems when converting to the epub format form html. It looked fine in the html debugging session but all the extra junk was causing it a problem on my reader. I added some keep_only_tags and remove_tags to my recipe to get it working. i will continue to work on the recipe and when its in good shape I will post the working version here.
eroche is offline   Reply With Quote
Old 09-05-2012, 01:55 PM   #3
scissors
Addict
scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.
 
Posts: 203
Karma: 1001369
Join Date: Sep 2010
Device: prs300, kindle keyboard 3g
That seems like a lot of hard work when calibre produces a pretty good output with

Spoiler:
Code:
class AdvancedUserRecipe1346865225(BasicNewsRecipe):
    title          = u'RTE News'
    
    max_articles_per_feed = 20
    oldest_article = 1
    
    remove_empty_feeds = True
    remove_javascript     = True
    #auto_cleanup = True
    auto_cleanup_keep = '//div[@id="photography"]|//span[@class="side-content"]'
    
    remove_tags=[
	dict(attrs={'id' : 'header-print'})
		]
    feeds          = [u'News', u'http://www.rte.ie/rss/news.xml']
scissors is offline   Reply With Quote
Old 09-07-2012, 03:13 AM   #4
eroche
Junior Member
eroche began at the beginning.
 
Posts: 4
Karma: 10
Join Date: Sep 2012
Device: sony ereader
Thanks Scissors, I know it does I guess its my way of learning how calibre works. I wanted to parse the links myself from the webpage so that I can do some testing for duplicates etc (when I combine multiple rss feeds) and manually identify which ones I want to include. My latest version for the RTE website is:

Spoiler:
#The following recipe extracts the text from all the RSS articles that are linked. The photos on the RTE website do not lend themselves to being included in a recipe

from BeautifulSoup import BeautifulSoup
import urllib2
from calibre.web.feeds.news import BasicNewsRecipe
from calibre.web.feeds.feedparser import parse
#import sys

global newsUrl
newsUrl = []
global sportUrl
sportUrl = []
global businessUrl
businessUrl = []
global masterUrl
masterUrl = []


class RTE(BasicNewsRecipe):
title = 'RTE (Ireland)'
description = 'Morning Newspaper from Ireland'
__author__ = 'Edward Roche'
language = 'en'
oldest_article = 1.0


#Start by getting the rss feeds and saving them in lists by section

#News Headlines
entries = parse('http://www.rte.ie/rss/news.xml').entries
for i, item in enumerate(entries):
feedtitle = item.get('title')
link = item.get('link')
description = item.get('description')
author = item.get('author')
date = item.get('date')
newsUrl.append( ( feedtitle , link, date))
masterUrl.append(link)

#Business Headlines
entries = parse('http://www.rte.ie/rss/business.xml').entries
for i, item in enumerate(entries):
feedtitle = item.get('title')
link = item.get('link')
description = item.get('description')
author = item.get('author')
date = item.get('date')
duplicateInd = False
for i in masterUrl:
if link == i:
duplicateInd = True
print "duplicate found =, ", link
if duplicateInd == False:
businessUrl.append( ( feedtitle , link, date))
masterUrl.append(link)

#Sports Headlines
entries = parse('http://www.rte.ie/rss/sport.xml').entries
for i, item in enumerate(entries):
feedtitle = item.get('title')
link = item.get('link')
description = item.get('description')
author = item.get('author')
date = item.get('date')
duplicateInd = False
for i in masterUrl:
if link == i:
duplicateInd = True
print "duplicate found =, ", link
if duplicateInd == False:
sportUrl.append( ( feedtitle , link, date))
masterUrl.append(link)

#The saved lists will each make up an article group in the ebook. For each article group add the headins to the TOC

def parse_index(self):
feeds = []
articles = self.RTE_parse_section(newsUrl)
feeds.append(('News Headlines', articles))
articles = self.RTE_parse_section(businessUrl)
feeds.append(('Business Headlines', articles))
articles = self.RTE_parse_section(sportUrl)
feeds.append(('Sport Headlines', articles))
return feeds

#Each article group will be made up of articles, set up the articles based on the URLS that we have already gotten
def RTE_parse_section(self, link):
current_articles = []
for file in link:
current_articles.append({'title': file[0], 'url': file[1], 'description':'', 'date':file[2]})
return current_articles


#Clean up the output
keep_only_tags = [
dict(name='div',attrs={'id': ['news-article-container']})
#,dict(name='article',attrs={'class': ['rte-sport-article']})
,dict(name='div',attrs={'class': ['rte_gr_8']})
]



remove_tags_after = [
dict(name='ul',attrs={'class': 'keywords'})
,dict(name='p',attrs={'class': 'sticky-footer-leadin'})
,dict(name='div',attrs={'id': 'storyBody'})
]

remove_tags = [
dict(name='ul',attrs={'class': 'keywords'})
,dict(name='div',attrs={'id': ['user-options-top','tab-group','related','photography','user-options-bottom']})
,dict(name='div',attrs={'class': ['clear','photo-count','thumbnails','news-gallery-regular','side-content multimedia video','side-content multimedia audio']})
,dict(name='a',attrs={'class': ['photo-prev','photo-next']})
, dict(name='meta')
, dict(name='link')
, dict(name='script')
,dict(name='figure')
,dict(name='p',attrs={'class': 'sticky-footer-leadin'})
,dict(name='section',attrs={'id': 'article-media-box'})
,dict(name='footer',attrs={'class': 'clearfix'})
,dict(name='nav',attrs={'id': 'breadcrumb'})
]

no_stylesheets = True

extra_css = '''
body {
#color: rgb(0,0,0);
#background-color:rgb(174,174,174);
text-align:justify;
line-spacing:1.8;
#margin-top:0px;
#margin-bottom:4px;
#margin-right:50px;
#margin-left:50px;
#text-indent:2em;
}
h1, h2, h3, h4, h5, h6 {
#color:white;
text-align:center;
font-style:italic;
font-weight:bold;
}
p {
text-align:left;
}
ul{
list-style: none
}
li {
list-style: none
padding-top:5px;
}
img {
}

'''


def preprocess_html(self, soup):
#outputFile = 'D:\My Python Sample Code\Calibre Recipes\RTE\RawSoup\output'+soup.title.string+'.ht ml'
#print "out " +outputFile
#if 'Final Countdown' in soup.title.string:
# sys.exit()
#f = open(outputFile,"w")
#f.write(soup.prettify())
#f.close()
for alink in soup.findAll('a'):
if alink.string is not None:
tstr = alink.string
alink.replaceWith(tstr)
return soup

def get_cover_url(self):
url = 'http://dramafestival.ie/index.php_files/images/RTE%20logo.gif'
return url



This recipe extracts all the text from the news, business and sport rss feeds. It ignores the pictures as they are difficult to handle from this site.

Last edited by eroche; 09-07-2012 at 06:03 AM.
eroche is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Opus Web pages obsalys Bookeen 2 09-06-2011 09:56 AM
A New Way to Read Web Pages on Your Kindle jsingleton Amazon Kindle 11 12-18-2009 03:20 AM
Web pages on the DR1000 allovertheglobe iRex 0 10-12-2008 03:40 PM
Web Pages andyafro Sony Reader 0 11-05-2007 09:57 AM
Web pages to Reader? Moadib Sony Reader 17 01-10-2007 11:46 AM


All times are GMT -4. The time now is 11:34 AM.


MobileRead.com is a privately owned, operated and funded community.