11-07-2010, 04:49 PM | #1 |
Member
Posts: 11
Karma: 14
Join Date: Nov 2010
Device: none
|
European Voice (by The Economist) - Problem with skipping
Hi, I fiddled around with python today and try to understand the mechanisms of the calibre framework.
The problem with European Voice is, that several linked articles from their feed can only be read as a subscriber. So I want to skip these articles as it becomes clear, that they don't contain any content. I thought the preprocess_html method could be used. If the word subscriber is in the soup, I just don't return the soup object but none. Code:
class EuropeanVoice(BasicNewsRecipe): title = u'European Voice' oldest_article = 14 max_articles_per_feed = 100 no_stylesheets = True cover_url = 'http://www.europeanvoice.com/Css/images/logo.gif' language = 'en' keep_only_tags = [dict(name='div', attrs={'id':'articleLeftColumn'})] remove_tags = [dict(name='div', attrs={'id':'BreadCrump'})] feeds = [ (u'Whole site ',u'http://www.europeanvoice.com/Rss/2.xml'), (u'News and analysis',u'http://www.europeanvoice.com/Rss/6.xml'), (u'Comment',u'http://www.europeanvoice.com/Rss/7.xml'), (u'Special reports',u'http://www.europeanvoice.com/Rss/5.xml'), (u'People',u'http://www.europeanvoice.com/Rss/8.xml'), (u'Career',u'http://www.europeanvoice.com/Rss/11.xml'), (u'Policies',u'http://www.europeanvoice.com/Rss/4.xml'), (u'EVents',u'http://www.europeanvoice.com/Rss/10.xml'), (u'Policies - Economics',u'http://www.europeanvoice.com/Rss/31.xml'), (u'Policies - Business',u'http://www.europeanvoice.com/Rss/19.xml'), (u'Policies - Trade',u'http://www.europeanvoice.com/Rss/25.xml'), (u'Policies - Information society',u'http://www.europeanvoice.com/Rss/20.xml'), (u'Policies - Energy',u'http://www.europeanvoice.com/Rss/15.xml'), (u'Policies - Transport',u'http://www.europeanvoice.com/Rss/18.xml'), (u'Policies - Climate change',u'http://www.europeanvoice.com/Rss/16.xml'), (u'Policies - Environment',u'http://www.europeanvoice.com/Rss/17.xml'), (u'Policies - Farming & food',u'http://www.europeanvoice.com/Rss/23.xml'), (u'Policies - Health & society',u'http://www.europeanvoice.com/Rss/24.xml'), (u'Policies - Justice',u'http://www.europeanvoice.com/Rss/29.xml'), (u'Policies - Foreign affairs',u'http://www.europeanvoice.com/Rss/27.xml') ] extra_css = ''' h1{font-family:Arial,Helvetica,sans-serif; font-weight:bold;font-size:large;} h2{font-family:Arial,Helvetica,sans-serif; font-weight:normal;font-size:small;} p{font-family:Arial,Helvetica,sans-serif;font-size:small;} body{font-family:Helvetica,Arial,sans-serif;font-size:small;} ''' def print_version(self, url): return url + '?bPrint=1' def preprocess_html(self, soup): self.log('\t checking for subscriber only content') denied = soup.findAll(True,text='Subscribers') if denied: self.log('\t skipped, because content can only be seen with subscription') return None return soup |
11-07-2010, 09:46 PM | #2 | |
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Quote:
https://www.mobileread.com/forums/sho...62&postcount=6 |
|
11-08-2010, 03:20 PM | #3 |
Member
Posts: 11
Karma: 14
Join Date: Nov 2010
Device: none
|
That is a different szenario, because they can exclude articles based on the information delivered in the feed. I need to download an article to see, if it should be skipped. That information is in the content, not based upon a character string in the title of the article or part of the URL as it's the case in your posted example
|
11-08-2010, 03:56 PM | #4 |
Zealot
Posts: 122
Karma: 10
Join Date: Jul 2010
Device: nook
|
this was actually a problem for me and a solution for you.
Code:
def preprocess_html(self, soup): self.log('\t checking for subscriber only content') denied = soup.findAll(True,text='Subscribers') print denied return soup edit: it will not do the trick. it will keep only the ones that need to be read as a subscriber. you need to inverse the find. in other words find a constant attribute in the other articles. write back if you didnt get it (its late here and i am not thinking straight). Last edited by marbs; 11-08-2010 at 04:06 PM. |
11-08-2010, 04:34 PM | #5 |
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Yes, it needs to be reversed. I believe he wants to return None on the articles he wants skipped. If an article is empty, Calibre will skip it, by default.
|
11-09-2010, 12:30 AM | #6 |
Zealot
Posts: 122
Karma: 10
Join Date: Jul 2010
Device: nook
|
good morning!
this is the answer: Code:
def preprocess_html(self, soup): self.log('\t checking for subscriber only content') denied = soup.findAll(True,text='Subscribers') if denied: self.log('\t skipped, because content can only be seen with subscription') print somthing return soup |
11-09-2010, 12:45 PM | #7 |
Member
Posts: 11
Karma: 14
Join Date: Nov 2010
Device: none
|
That is exactly what I wrote, except that you have "print somthing" instead of return None
You example returns soup unconditionally, so I don't understand why that article would be skipped. What I wrote does somehow function. But the skipped articles are still in the index and displayed... |
11-09-2010, 01:07 PM | #8 |
Member
Posts: 11
Karma: 14
Join Date: Nov 2010
Device: none
|
You're right, your solution works.
But I don't understand why "print undefined_variable" has the wanted result while "return None" has not. Can you explain that? What do I need to do, to get that recipe included in the official calibre release? |
11-09-2010, 03:32 PM | #9 |
creator of calibre
Posts: 43,866
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Just post the final recipe here.
|
11-09-2010, 03:58 PM | #10 |
Zealot
Posts: 122
Karma: 10
Join Date: Jul 2010
Device: nook
|
you were returning none. that means that when calibre was looking for the article and found none. that is still something. a none instance of soup.
in my solution the recipe got to a stage that it said something like: article=postprocess_html(soup) article was expecting to get a soup instance back, instead postprocess exited in the middle(there is no veriable called "something"). when calibre went to collect all the articles from the places they were saved in, it went to the article we messed up and said "this does not exist at all, i must have not downloaded this file". so it was not indexed.... if you want to build in a recipe, post the full code here with a "hey Kovid, this is ready to be built in" or something along thous lines. edit: i see Kovid beat me to it. so ill throw in a "hey Kovid" my self. hey kovid |
11-10-2010, 01:43 PM | #11 |
Member
Posts: 11
Karma: 14
Join Date: Nov 2010
Device: none
|
hey Kovid, this is ready to be built in ;-)
Code:
class EuropeanVoice(BasicNewsRecipe): title = u'European Voice' oldest_article = 14 max_articles_per_feed = 100 no_stylesheets = True cover_url = 'http://www.europeanvoice.com/Css/images/logo.gif' language = 'en' keep_only_tags = [dict(name='div', attrs={'id':'articleLeftColumn'})] remove_tags = [dict(name='div', attrs={'id':'BreadCrump'})] feeds = [ (u'Whole site ',u'http://www.europeanvoice.com/Rss/2.xml'), (u'News and analysis',u'http://www.europeanvoice.com/Rss/6.xml'), (u'Comment',u'http://www.europeanvoice.com/Rss/7.xml'), (u'Special reports',u'http://www.europeanvoice.com/Rss/5.xml'), (u'People',u'http://www.europeanvoice.com/Rss/8.xml'), (u'Career',u'http://www.europeanvoice.com/Rss/11.xml'), (u'Policies',u'http://www.europeanvoice.com/Rss/4.xml'), (u'EVents',u'http://www.europeanvoice.com/Rss/10.xml'), (u'Policies - Economics',u'http://www.europeanvoice.com/Rss/31.xml'), (u'Policies - Business',u'http://www.europeanvoice.com/Rss/19.xml'), (u'Policies - Trade',u'http://www.europeanvoice.com/Rss/25.xml'), (u'Policies - Information society',u'http://www.europeanvoice.com/Rss/20.xml'), (u'Policies - Energy',u'http://www.europeanvoice.com/Rss/15.xml'), (u'Policies - Transport',u'http://www.europeanvoice.com/Rss/18.xml'), (u'Policies - Climate change',u'http://www.europeanvoice.com/Rss/16.xml'), (u'Policies - Environment',u'http://www.europeanvoice.com/Rss/17.xml'), (u'Policies - Farming & food',u'http://www.europeanvoice.com/Rss/23.xml'), (u'Policies - Health & society',u'http://www.europeanvoice.com/Rss/24.xml'), (u'Policies - Justice',u'http://www.europeanvoice.com/Rss/29.xml'), (u'Policies - Foreign affairs',u'http://www.europeanvoice.com/Rss/27.xml') ] extra_css = ''' h1{font-family:Arial,Helvetica,sans-serif; font-weight:bold;font-size:large;} h2{font-family:Arial,Helvetica,sans-serif; font-weight:normal;font-size:small;} p{font-family:Arial,Helvetica,sans-serif;font-size:small;} body{font-family:Helvetica,Arial,sans-serif;font-size:small;} ''' def print_version(self, url): return url + '?bPrint=1' def preprocess_html(self, soup): self.log('\t checking for subscriber only content') denied = soup.findAll(True,text='Subscribers') if denied: self.log('\t skipped, because content can only be seen with subscription') print something return soup |
11-10-2010, 01:56 PM | #12 |
creator of calibre
Posts: 43,866
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Will be in next release
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Skipping ahead / Index | LucentBeam | Amazon Kindle | 5 | 10-28-2010 09:08 PM |
Classic nook skipping pages | qpid360 | Barnes & Noble NOOK | 4 | 12-28-2009 04:31 PM |
Skipping directly to 2.0.3? | whitearrow | Amazon Kindle | 3 | 04-29-2009 09:42 PM |
Firmware Update Any problems with skipping an update? | sbell1 | Amazon Kindle | 6 | 04-22-2009 05:00 PM |
Automatically skipping to the end of the document? | senseabove | iRex | 4 | 02-03-2009 01:36 AM |