09-02-2012, 06:24 AM | #1 |
Junior Member
Posts: 4
Karma: 10
Join Date: Sep 2012
Device: sony ereader
|
Help to debug and download pages from the web
Hi I am new to Calibre and I am trying to set up a recipe to download the headlines from a news site but my recipe is not working for me and i am at a loss as to how to debug it.
The logic I am using in my code is to open a webpage and get the headlines and their links using beautifulsoup then use calibre to get the content form these headline pages. It seems to be nearly working but the output is a bit funny. The text on some of the pages of the generate ebook is cut and when I transfer it to my ebook all the articles are blank, it just shows the table of contents. In terms of debugging I dont know how to debug the output. If I use the --test option I get output but I dont understand whats wrong still. Any help would be appreciated. My code is the following, and I have pasted the output from the command prompt below also: Code:
from BeautifulSoup import BeautifulSoup import urllib2 from calibre.web.feeds.news import BasicNewsRecipe global recipeUrl recipeUrl = [] global startUrl startUrl = [] class RTE(BasicNewsRecipe): title = 'RTE links in Ebook Format' description = 'Headlines from Ireland and International' __author__ = 'Edward Roche' language = 'en' startUrl = [('http://www.rte.ie/news')] for link in startUrl: soup = BeautifulSoup(urllib2.urlopen(link).read()) address= soup.find('div', {'id' : "more-headlines"}) while address.findNext('dd'): address = address.findNext('dd') if not address.text == 'Full News Index': recipeUrl.append( (address.text , 'http://www.rte.ie' + address.a['href'])) def parse_index(self): feeds = [] articles = self.RTE_parse_section('') feeds.append(('News headlines', articles)) return feeds def RTE_parse_section(self, link): current_articles = [] for file in recipeUrl: current_articles.append({'title': file[0], 'url': file[1], 'description':'', 'date':''}) return current_articles Code:
D:\My Python Sample Code\Calibre Recipes>ebook-convert rte2.recipe rte.epub --te st 1% Converting input to HTML... InputFormatPlugin: Recipe Input running 1% Fetching feeds... 1% Got feeds from index page 1% Trying to download cover... 1% Generating masthead... Synthesizing mastheadImage 1% Starting download [4 thread(s)]... Traceback (most recent call last): Traceback (most recent call last): File "site-packages\calibre\web\fetch\simple.py", line 371, in process_images File "site-packages\calibre\web\fetch\simple.py", line 371, in process_images File "site-packages\PIL\Image.py", line 1982, in open File "site-packages\PIL\Image.py", line 1982, in open IOError: cannot identify image file IOError: cannot identify image file 17% Article downloaded: Taoiseach congratulates Paralympic winners 34% Article downloaded: Daly leaves Socialist Party after Wallace row 34% Feeds downloaded to C:\Users\Edward\AppData\Local\Temp\calibre_0.8.50_tmp_rk z6ug\dkga2y_plumber\index.html 34% Download finished Parsing all content... Forcing index.html into XHTML namespace CSSStyleDeclaration: Unexpected token, ignoring upto u'*font-size:small;'. [6:2: *] CSSStyleDeclaration: Unexpected token, ignoring upto u'*font:x-small;'. [8:2: *] CSSStyleDeclaration: Unexpected token, ignoring upto u'*font-size:108%;'. [34:2: *] Forcing feed_0/index.html into XHTML namespace CSSStyleDeclaration: Unexpected token, ignoring upto u'*font-size:small;'. [6:2: *] CSSStyleDeclaration: Unexpected token, ignoring upto u'*font:x-small;'. [8:2: *] CSSStyleDeclaration: Unexpected token, ignoring upto u'*font-size:108%;'. [34:2: *] Referenced file u'feed_0/article_0/images2010/bg_article_tab_group.jpg' not foun d Referenced file '/rtejr' not found Referenced file u'feed_0/article_0/images2010/sprite_main.png' not found Referenced file u'feed_0/article_1/images2010/bg_subhead_presidential_election_2 011.png' not found Referenced file '/lifestyle/fashion' not found Referenced file u'feed_0/article_1/recommend' not found Referenced file '/news/ireland.html' not found Referenced file '/about/en/policies-and-reports/policies-guidelines/2012/0417/31 7412-rte-data-protection-' not found Referenced file u'feed_1/index.html' not found Referenced file '/live' not found Referenced file '/news/2012/0901/paralympics-gold.html' not found Referenced file '/lotto' not found Referenced file '/news/politics.html' not found Referenced file u'feed_0/article_0/images2010/bg_audio_list_item.jpg' not found Referenced file u'feed_0/article_0/images2010/bg_subhead_vote_2011.png' not foun d Referenced file '/news/business.html' not found Referenced file u'feed_0/article_0/images2010/header_frontline.jpg' not found Referenced file u'/news/images2010/news-top-referendum.png' not found Referenced file '/lifestyle/motors' not found Referenced file '/news/2012/0901/missing-sisters.html' not found Referenced file '/about/en/information-and-feedback/contact-rte' not found Referenced file '/about/en/information-and-feedback/complaints' not found Referenced file '/news/election2011' not found Referenced file '/ten/guide.html' not found Referenced file '/news/player_live.html' not found Referenced file '/player' not found Referenced file '/aertel' not found Referenced file '/lifestyle/food' not found Referenced file '/shop' not found Referenced file '/news/2012/0901/clare-daly-socialist-party.html' not found Referenced file '/lifestyle' not found Referenced file '/lifestyle/homes' not found Referenced file '/news' not found Referenced file '/news/index.html' not found Referenced file '/dating' not found Referenced file '/about/en/serving-our-audience/2012/0815/333706-terms-and-condi tions-for-rte-ie' not found Referenced file '/news/player.html' not found Referenced file '/news/fiscal-treaty.html' not found Referenced file u'feed_0/article_1/images2010/article_video_thumbnail_overlay.pn g' not found Referenced file '/news/2012/0901/mary-coughlan-husband-death.html' not found Referenced file '/' not found Referenced file u'feed_0/article_1/images2010/header_frontline.jpg' not found Referenced file u'/news/images2010/sprite_main.png' not found Referenced file '/business' not found Referenced file u'feed_0/article_0/recommend' not found Referenced file '/ten' not found Referenced file u'/news/images2010/bg_video_container.gif' not found Referenced file u'feed_0/article_0/images2010/article_audio_thumbnail_overlay.pn g' not found Referenced file '/sport' not found Referenced file '/news/911' not found Referenced file u'feed_0/article_0/images2010/bg_subhead_presidential_election_2 011.png' not found Referenced file '/about/en/working-with-rte/vacancies' not found Referenced file '/news/presidentialelection.html' not found Referenced file '/about/en/policies-and-reports/policies-guidelines/2012/0417/31 7440-rte-privacy-statement' not found Referenced file '/news/galleries.html' not found Referenced file u'feed_0/article_1/stylesheets/bg_filter_icon_l.gif' not found Referenced file '/about/en/how-rte-is-run/2012/0221/291618-the-license-fee' not found Referenced file u'feed_0/article_1/images2010/bg_audio_list_item.jpg' not found Referenced file u'/news/images2010/player-loader.gif' not found Referenced file '/news/vote2011' not found Referenced file '/newsnow' not found Referenced file '/extra/apps.html' not found Referenced file '/archives' not found Referenced file u'feed_0/article_1/images2010/sprite_main.png' not found Referenced file '/about' not found Referenced file u'feed_0/article_0/images2010/bg_breaking_news.jpg' not found Referenced file '/news/2012/0901/american-football-tourists.html' not found Referenced file '/tv' not found Referenced file '/news/search_results.html%3fquery%3d%22clare%20daly%22' not fou nd Referenced file u'feed_0/article_1/images2010/september_11th_bg.png' not found Referenced file u'feed_0/article_0/images2010/sprite_subscribe.png' not found Referenced file '/jobs' not found Referenced file '/about/en/working-with-rte/2012/0727/330809-advertisers' not fo und Referenced file '/news/world.html' not found Referenced file u'feed_0/article_0/images2010/article_video_thumbnail_overlay.pn g' not found Referenced file '/about/en/policies-and-reports/annual-reports' not found Referenced file u'feed_0/article_0/images2010/bg_nav_current.jpg' not found Referenced file '/news/2012/0901/man-appears-before-special-sitting-of-dundalk-c t.html' not found Referenced file u'feed_0/article_0/images2010/bg_photo_count.png' not found Referenced file '/news/2012/0901/croke-park-agreement.html' not found Referenced file '/news/money/index.html' not found Referenced file u'/news/election2011/images/masthead1.png' not found Referenced file u'feed_0/article_1/images2010/bg_photo_count.png' not found Referenced file u'feed_0/article_1/images2010/bg_nav_current.jpg' not found Referenced file '/lifestyle/travel' not found Referenced file u'feed_0/article_1/images2010/bg_subhead_vote_2011.png' not foun d Referenced file '/weather' not found Referenced file '/news/business' not found Referenced file u'feed_0/article_0/images2010/bg_header.jpg' not found Referenced file u'/images/ajax-loader.gif' not found Referenced file u'feed_0/article_0/images2010/september_11th_bg.png' not found Referenced file '/emails' not found Referenced file '/news/search_results.html%3fquery%3d%22paralympics%22' not foun d Referenced file '/radio' not found Referenced file u'feed_0/article_1/images2010/bg_header.jpg' not found Referenced file u'feed_0/article_1/images2010/article_audio_thumbnail_overlay.pn g' not found Referenced file '/news/special-reports.html' not found Referenced file u'feed_0/article_1/images2010/sprite_subscribe.png' not found Referenced file '/about/en/policies-and-reports/policies-guidelines/2012/0815/33 3711-user-comments-terms-of-use' not found Referenced file '/trte' not found Referenced file '/news/2012/0902/former-leaders-should-face-charges-over-iraq-tu tu.html' not found Referenced file u'feed_0/article_1/images2010/bg_article_tab_group.jpg' not foun d Referenced file u'feed_0/article_1/images2010/bg_breaking_news.jpg' not found Referenced file '/performinggroups' not found Referenced file u'feed_0/article_0/stylesheets/bg_filter_icon_l.gif' not found Referenced file '/news/search_results.html%3fquery%3d%22enda%20kenny%22' not fou nd 34% Running transforms on ebook... Merging user specified metadata... Detecting structure... Flattening CSS and remapping font sizes... Source base font size is 8.70480pt Removing fake margins... Removing level div_1 left margin of: auto Removing level div_1 right margin of: auto Removing level div_2 left margin of: 7px Removing level div_2 right margin of: 7px Cleaning up manifest... Trimming unused files from manifest... Trimming u'feed_0/article_0/images/img22.jpg' from manifest Trimming u'feed_0/article_0/images/img1.jpg' from manifest Trimming u'feed_0/article_0/images/img21.jpg' from manifest Trimming u'feed_0/article_1/images/img11.jpg' from manifest Trimming u'feed_0/article_1/images/img1.jpg' from manifest Trimming u'feed_0/article_1/images/img31.jpg' from manifest Trimming u'feed_0/article_0/images/img2.jpg' from manifest Trimming u'feed_0/article_1/images/img32.jpg' from manifest Trimming u'feed_0/article_1/images/img2.jpg' from manifest Trimming u'feed_0/article_0/images/img4.jpg' from manifest Creating EPUB Output... 67% Creating EPUB Output Found non-unique filenames, renaming to support broken EPUB readers like FBReade r, Aldiko and Stanza... Splitting markup on page breaks and flow limits, if any... Looking for large trees in feed_0/article_0/index.html... No large trees found Looking for large trees in index_u2.html... No large trees found Looking for large trees in feed_0/index_u3.html... No large trees found Looking for large trees in feed_0/article_1/index_u1.html... No large trees found The cover image has an id != "cover". Renaming to work around bug in Nook Color EPUB output written to D:\My Python Sample Code\Calibre Recipes\rte.epub Output saved to D:\My Python Sample Code\Calibre Recipes\rte.epub |
09-03-2012, 10:30 AM | #2 |
Junior Member
Posts: 4
Karma: 10
Join Date: Sep 2012
Device: sony ereader
|
I figured out what was wrong. The links were actually downloading the whole page and this was causing problems when converting to the epub format form html. It looked fine in the html debugging session but all the extra junk was causing it a problem on my reader. I added some keep_only_tags and remove_tags to my recipe to get it working. i will continue to work on the recipe and when its in good shape I will post the working version here.
|
Advert | |
|
09-05-2012, 01:55 PM | #3 |
Addict
Posts: 241
Karma: 1001369
Join Date: Sep 2010
Device: prs300, kindle keyboard 3g
|
That seems like a lot of hard work when calibre produces a pretty good output with
Spoiler:
|
09-07-2012, 03:13 AM | #4 |
Junior Member
Posts: 4
Karma: 10
Join Date: Sep 2012
Device: sony ereader
|
Thanks Scissors, I know it does I guess its my way of learning how calibre works. I wanted to parse the links myself from the webpage so that I can do some testing for duplicates etc (when I combine multiple rss feeds) and manually identify which ones I want to include. My latest version for the RTE website is:
Spoiler:
This recipe extracts all the text from the news, business and sport rss feeds. It ignores the pictures as they are difficult to handle from this site. Last edited by eroche; 09-07-2012 at 06:03 AM. |
Thread Tools | Search this Thread |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Opus Web pages | obsalys | Bookeen | 2 | 09-06-2011 09:56 AM |
A New Way to Read Web Pages on Your Kindle | jsingleton | Amazon Kindle | 11 | 12-18-2009 03:20 AM |
Web pages on the DR1000 | allovertheglobe | iRex | 0 | 10-12-2008 03:40 PM |
Web Pages | andyafro | Sony Reader | 0 | 11-05-2007 09:57 AM |
Web pages to Reader? | Moadib | Sony Reader | 17 | 01-10-2007 11:46 AM |