Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 11-05-2011, 03:46 AM   #1
NotTaken
Connoisseur
NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.
 
Posts: 65
Karma: 4640
Join Date: Aug 2011
Device: kindle
The Independent : Updated recipe for 2011 site redesign

As you probably know, the independent has recently updated its website and this has broken the old recipe.

Here is an initial basic recipe for the new site. I thought it would be good to make a thread for people to post improvements.
Attached Files
File Type: zip independent.recipe.zip (1,019 Bytes, 248 views)
NotTaken is offline   Reply With Quote
Old 11-06-2011, 12:12 AM   #2
NotTaken
Connoisseur
NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.
 
Posts: 65
Karma: 4640
Join Date: Aug 2011
Device: kindle
for those who like images

Updated the recipe to pull in the images. Do you reckon its best to limit these to one per article?

Regarding the categories, is it possible to merge them into a parent category if the number of articles is below a certain threshold?
Attached Files
File Type: zip independent-new.recipe.zip (2.2 KB, 215 views)
NotTaken is offline   Reply With Quote
Advert
Old 11-06-2011, 01:19 AM   #3
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,826
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
If you want to do some dynamic modifications to the categories, you will need to override parse_feeds in your recipe, like this:

Code:
def parse_feeds(self):
  feeds = BasicNewsRecipe.parse_feeds(self)
  # do something to the feeds
  return feeds
kovidgoyal is offline   Reply With Quote
Old 11-06-2011, 05:24 PM   #4
NotTaken
Connoisseur
NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.
 
Posts: 65
Karma: 4640
Join Date: Aug 2011
Device: kindle
another update

Thanks kovid, I really appreciate all the work you do on calibre.

I've made a few changes to the recipe:
  • Added a few extra guards to make it more robust
  • Got rid of the "Caption:" label
  • Added a postprocess parse to remove the captions that resulted from when an image fetch failed
  • Stop adding empty captions
  • Changed "feedsportal" rss feeds to Independent.co.uk equivalents

I noticed that their web server frequently times out when processing a request (probably initial teething problems with the new site). This resulted in a lot of captions being added without an image. Hopefully the post process parse should take care if this issue.
Attached Files
File Type: zip independent-new.recipe.zip (2.4 KB, 230 views)

Last edited by NotTaken; 11-06-2011 at 05:37 PM.
NotTaken is offline   Reply With Quote
Old 11-06-2011, 09:12 PM   #5
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,826
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
For the timeouts, try adding delay=1 to your recipe. That will greatly slow down the download but it might prevent the timeouts.
kovidgoyal is offline   Reply With Quote
Advert
Old 11-07-2011, 06:20 PM   #6
NotTaken
Connoisseur
NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.
 
Posts: 65
Karma: 4640
Join Date: Aug 2011
Device: kindle
Quote:
Originally Posted by kovidgoyal View Post
For the timeouts, try adding delay=1 to your recipe. That will greatly slow down the download but it might prevent the timeouts.
I was getting the timeouts whilst browsing their site normally - seemed just to refuse to serve some pictures.

Slight update to the recipe to add star images to the reviews.
Attached Files
File Type: zip independent-new.recipe.zip (2.9 KB, 239 views)
NotTaken is offline   Reply With Quote
Old 11-08-2011, 10:35 PM   #7
mufc
Connoisseur
mufc doesn't littermufc doesn't litter
 
Posts: 99
Karma: 170
Join Date: Nov 2010
Location: Airdrie Alberta
Device: Sony 650
It is great to finally get a working recipe for the Independent ( my fav read) . What part of the recipe can I eliminate to remove the pictures and picture caption?
A way to get the print version would be even better
This is the difference @ the end of url

html

html?printService=print

Last edited by mufc; 11-08-2011 at 11:04 PM.
mufc is offline   Reply With Quote
Old 11-08-2011, 11:32 PM   #8
mufc
Connoisseur
mufc doesn't littermufc doesn't litter
 
Posts: 99
Karma: 170
Join Date: Nov 2010
Location: Airdrie Alberta
Device: Sony 650
Had some success but

Spoiler:

from calibre.web.feeds.news import BasicNewsRecipe
class AdvancedUserRecipe1289709253(BasicNewsRecipe):
title = u'The Independent My Test'
oldest_article = 7
max_articles_per_feed = 200
summary_length = 100
use_embedded_content = False
no_stylesheets = True
auto_cleanup = True
encoding = 'utf8'

remove_javascript = True

extra_css = '''
h1{font-family:Arial,sans-serif; font-weight:bold;font-size:large;}
h2{font-family:Arial,sans-serif; font-weight:normal;font-size:small;}
body{font-family:Arial,sans-serif;font-size:small;}
p{font-family:Arial,sans-serif;font-size:small;line-height: 1.2;margin-bottom: 0;margin-left: 2pt;
margin-right: 2pt;margin-top: 0;padding-left: 0;padding-right: 0;text-align: left;text-indent: 1.5em}
'''





feeds = [
(u'News - UK',
u'http://www.independent.co.uk/news/uk/?service=rss'),
(u'News - World',
u'http://www.independent.co.uk/news/world/?service=rss'),
(u'News -People Profiles',
u'http://www.independent.co.uk/news/people/profiles/?service=rss'),
(u'News - People',
u'http://www.independent.co.uk/news/people/?service=rss'),
(u'News - Media',
u'http://www.independent.co.uk/news/media/?service=rss'),
(u'Opinion',
u'http://www.independent.co.uk/opinion/?service=rss'),
(u'Sport - Football',
u'http://www.independent.co.uk/sport/football/?service=rss'),
(u'Sport - Fooball Comments',
u'http://www.independent.co.uk/sport/football/news-and-comment/?service=rss'),
(u'Life & Style - Health and Families',
u'http://www.independent.co.uk/life-style/health-and-families/?service=rss'
),
(u'Life & Style - Gadgets & Tech',
u'http://www.independent.co.uk/life-style/gadgets-and-tech/?service=rss'
),
(u'Arts & Ents - Music',
u'http://www.independent.co.uk/arts-entertainment/music/?service=rss'
),
(u'Arts & Ents - Comedy',
u'http://www.independent.co.uk/arts-entertainment/comedy/?service=rss'
)]



def print_version(self, url):
return url.replace('html', 'html?printService=print')


Thiis works on the articles I get. Problem is I am not getting many articles.
For example before trying to get the print version I was getting 100 articles for UK News. With recipe change I get 5. Some categories have none.
Before anyone states the obvious. Yes "oldest article 7" days is a bit much

Last edited by mufc; 11-08-2011 at 11:35 PM.
mufc is offline   Reply With Quote
Old 11-09-2011, 01:05 PM   #9
NotTaken
Connoisseur
NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.
 
Posts: 65
Karma: 4640
Join Date: Aug 2011
Device: kindle
Looks like they try and prevent direct linking to the print pages. To remove the images you can always change remove to True in the following piece of code:
Code:
            #images
            pattern = re.compile('slideshow')    
            if (pattern.search(item['class'])) is not None:
                remove = False
You'll get the added bonus of a subtitle and review ratings over the print version.
NotTaken is offline   Reply With Quote
Old 11-09-2011, 09:28 PM   #10
mufc
Connoisseur
mufc doesn't littermufc doesn't litter
 
Posts: 99
Karma: 170
Join Date: Nov 2010
Location: Airdrie Alberta
Device: Sony 650
That works great. Thanks a Million !. I thought doing that would leave the photo caption but that is gone too.
mufc is offline   Reply With Quote
Old 11-11-2011, 02:05 PM   #11
NotTaken
Connoisseur
NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.
 
Posts: 65
Karma: 4640
Join Date: Aug 2011
Device: kindle
A few more changes

Some changes:
  • Filter feeds with title prefix 'Video:' - most only have one line of text
  • Prevented duplicated content by setting recursions to zero and checking url existence against a list of feeds already processed
  • Removed line breaks and empty paragraphs from the storyTop section as these cause unsightly white space (tries to sensibly replace line breaks between text with spaces)
  • Try to fetch extra images related to the content when labeled with ...Click here for graphic... (this may need improving if the pattern changes wildy) - see this page for an example
  • Added some flags up the top to disable image fetching

I was thinking about removing the advertorial articles (see here) but could not see a clean way of doing this. As far as I am aware, they are only identifiable by the text 'Advertorial Feature ' in <div class=" ... strapLine"> so I was thinking of returning None in preprocess_soup if the text was found (this causes an AttributeError exception to be raised). Can anyone think of a nicer solution?
Attached Files
File Type: zip independent.recipe.zip (4.6 KB, 198 views)

Last edited by NotTaken; 11-11-2011 at 02:09 PM.
NotTaken is offline   Reply With Quote
Old 11-11-2011, 09:45 PM   #12
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,826
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Returning None in preprocess is fine. If you wish to be more explicit about it you could raise an Exception, though I dont recall if the download system discards exceptions in that method.
kovidgoyal is offline   Reply With Quote
Old 11-13-2011, 07:45 AM   #13
NotTaken
Connoisseur
NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.
 
Posts: 65
Karma: 4640
Join Date: Aug 2011
Device: kindle
update

Thanks. A few updates:
  • Fixed some flawed logic in detection of empty paragraphs in storyTop (didn't consider nesting)
  • Updated article graphics regex to be more generic
  • Filtered advertorial features
Attached Files
File Type: zip independent.recipe.zip (4.7 KB, 184 views)
NotTaken is offline   Reply With Quote
Old 11-25-2011, 06:44 PM   #14
NotTaken
Connoisseur
NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.
 
Posts: 65
Karma: 4640
Join Date: Aug 2011
Device: kindle
Fixed an issue whereby a KeyError was raised on pages with embedded flash videos. These pages also had some other crud, which I have also removed.
Attached Files
File Type: zip independent.recipe.zip (4.8 KB, 193 views)
NotTaken is offline   Reply With Quote
Old 12-16-2011, 08:31 AM   #15
dasym
Connoisseur
dasym began at the beginning.
 
Posts: 50
Karma: 10
Join Date: Dec 2008
Location: Scotland
Device: Kindle DX, Kindle. iPad 3
I'm afraid neither the built-in recipe nor this one is working for me for The Independent any more. It downloads very few links and typically gives error messages in the log such as:

Could not fetch link http://www.independent.co.uk/news/uk...s-6277966.html
Traceback (most recent call last):
File "site-packages\calibre\web\fetch\simple.py", line 432, in process_links
File "site-packages\calibre\web\fetch\simple.py", line 193, in get_soup
File "c:\docume~1\dave\locals~1\temp\calibre_0.8.31_tmp _uf67cn\2zkxsh_recipes\recipe0.py", line 202, in preprocess_html
File "c:\docume~1\dave\locals~1\temp\calibre_0.8.31_tmp _uf67cn\2zkxsh_recipes\recipe0.py", line 275, in _insertRatingStars
IndexError: list index out of range

I wish I could help but modifying the recipes is a little beyond me.
dasym is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
[Updated recipe] Ming Pao (明報) - Hong Kong (2011/10/21) tylau0 Recipes 0 10-21-2011 11:38 AM
[Updated recipe] Ming Pao (明報) - Hong Kong (2011/09/21) tylau0 Recipes 0 09-21-2011 07:13 AM
[Updated recipe] Ming Pao (明報) - Hong Kong (2011/09/20) tylau0 Recipes 1 09-20-2011 06:56 PM
[Updated recipe] Ming Pao (明報) - Hong Kong (2011/06/26) tylau0 Recipes 3 06-28-2011 12:17 PM
Updated Recipe: Ming Pao - Hong Kong (2011/03/08) tylau0 Recipes 0 03-08-2011 07:25 PM


All times are GMT -4. The time now is 05:17 AM.


MobileRead.com is a privately owned, operated and funded community.