European Voice (by The Economist) - Problem with skipping

malfi · 11-07-2010, 04:49 PM

Hi, I fiddled around with python today and try to understand the mechanisms of the calibre framework.

The problem with European Voice is, that several linked articles from their feed can only be read as a subscriber. So I want to skip these articles as it becomes clear, that they don't contain any content. I thought the preprocess_html method could be used. If the word subscriber is in the soup, I just don't return the soup object but none.

Code:

class EuropeanVoice(BasicNewsRecipe):
    title          = u'European Voice'
    oldest_article = 14
    max_articles_per_feed = 100
    no_stylesheets = True
    cover_url = 'http://www.europeanvoice.com/Css/images/logo.gif'
    language = 'en'
    keep_only_tags    = [dict(name='div', attrs={'id':'articleLeftColumn'})]
    remove_tags    = [dict(name='div', attrs={'id':'BreadCrump'})]
    feeds          = [
                        (u'Whole site ',u'http://www.europeanvoice.com/Rss/2.xml'),
                          (u'News and analysis',u'http://www.europeanvoice.com/Rss/6.xml'),
                          (u'Comment',u'http://www.europeanvoice.com/Rss/7.xml'),
                          (u'Special reports',u'http://www.europeanvoice.com/Rss/5.xml'),
                          (u'People',u'http://www.europeanvoice.com/Rss/8.xml'),
                          (u'Career',u'http://www.europeanvoice.com/Rss/11.xml'),
                          (u'Policies',u'http://www.europeanvoice.com/Rss/4.xml'),
                          (u'EVents',u'http://www.europeanvoice.com/Rss/10.xml'),
                          (u'Policies - Economics',u'http://www.europeanvoice.com/Rss/31.xml'),
                          (u'Policies - Business',u'http://www.europeanvoice.com/Rss/19.xml'),
                          (u'Policies - Trade',u'http://www.europeanvoice.com/Rss/25.xml'),
                          (u'Policies - Information society',u'http://www.europeanvoice.com/Rss/20.xml'),
                          (u'Policies - Energy',u'http://www.europeanvoice.com/Rss/15.xml'),
                          (u'Policies - Transport',u'http://www.europeanvoice.com/Rss/18.xml'),
                          (u'Policies - Climate change',u'http://www.europeanvoice.com/Rss/16.xml'),
                          (u'Policies - Environment',u'http://www.europeanvoice.com/Rss/17.xml'),
                          (u'Policies - Farming & food',u'http://www.europeanvoice.com/Rss/23.xml'),
                          (u'Policies - Health & society',u'http://www.europeanvoice.com/Rss/24.xml'),
                          (u'Policies - Justice',u'http://www.europeanvoice.com/Rss/29.xml'),
                          (u'Policies - Foreign affairs',u'http://www.europeanvoice.com/Rss/27.xml')
                     ]
    extra_css = '''
        h1{font-family:Arial,Helvetica,sans-serif; font-weight:bold;font-size:large;}
        h2{font-family:Arial,Helvetica,sans-serif; font-weight:normal;font-size:small;}
        p{font-family:Arial,Helvetica,sans-serif;font-size:small;}
        body{font-family:Helvetica,Arial,sans-serif;font-size:small;}
        '''
    def print_version(self, url):
          return url + '?bPrint=1'
    def preprocess_html(self, soup):
           self.log('\t checking for subscriber only content')
           denied = soup.findAll(True,text='Subscribers')
           if denied:
                self.log('\t skipped, because content can only be seen with subscription')
                return None
           return soup

That doesn't really work, because these articles are merely not downloaded, but still in the index. How am I supposed to be able to skip articles comletely?

Starson17 · 11-07-2010, 09:46 PM

Quote:

Originally Posted by malfi

That doesn't really work, because these articles are merely not downloaded, but still in the index. How am I supposed to be able to skip articles comletely?

See here:
https://www.mobileread.com/forums/sho...62&postcount=6

malfi · 11-08-2010, 03:20 PM

That is a different szenario, because they can exclude articles based on the information delivered in the feed. I need to download an article to see, if it should be skipped. That information is in the content, not based upon a character string in the title of the article or part of the URL as it's the case in your posted example

marbs · 11-08-2010, 03:56 PM

this was actually a problem for me and a solution for you.

Code:

    def preprocess_html(self, soup):
           self.log('\t checking for subscriber only content')
           denied = soup.findAll(True,text='Subscribers')
           print denied
           return soup

i think that might do the trick.

edit:
it will not do the trick. it will keep only the ones that need to be read as a subscriber. you need to inverse the find. in other words find a constant attribute in the other articles.
write back if you didnt get it (its late here and i am not thinking straight).

Starson17 · 11-08-2010, 04:34 PM

Quote:

Originally Posted by marbs

you need to inverse the find. in other words find a constant attribute in the other articles.
write back if you didnt get it (its late here and i am not thinking straight).

Yes, it needs to be reversed. I believe he wants to return None on the articles he wants skipped. If an article is empty, Calibre will skip it, by default.

marbs · 11-09-2010, 12:30 AM

good morning!

this is the answer:

Code:

    def preprocess_html(self, soup):
           self.log('\t checking for subscriber only content')
           denied = soup.findAll(True,text='Subscribers')
           if denied:
                self.log('\t skipped, because content can only be seen with subscription')
                print somthing
           return soup

copy this to the code and this should work.

malfi · 11-09-2010, 12:45 PM

That is exactly what I wrote, except that you have "print somthing" instead of return None

You example returns soup unconditionally, so I don't understand why that article would be skipped.

What I wrote does somehow function. But the skipped articles are still in the index and displayed...

malfi · 11-09-2010, 01:07 PM

You're right, your solution works.

But I don't understand why "print undefined_variable" has the wanted result while "return None" has not. Can you explain that?

What do I need to do, to get that recipe included in the official calibre release?

kovidgoyal · 11-09-2010, 03:32 PM

Just post the final recipe here.

marbs · 11-09-2010, 03:58 PM

you were returning none. that means that when calibre was looking for the article and found none. that is still something. a none instance of soup.
in my solution the recipe got to a stage that it said something like:
article=postprocess_html(soup)
article was expecting to get a soup instance back, instead postprocess exited in the middle(there is no veriable called "something").

when calibre went to collect all the articles from the places they were saved in, it went to the article we messed up and said "this does not exist at all, i must have not downloaded this file". so it was not indexed....

if you want to build in a recipe, post the full code here with a "hey Kovid, this is ready to be built in" or something along thous lines.

edit:
i see Kovid beat me to it. so ill throw in a "hey Kovid" my self.
hey kovid

malfi · 11-10-2010, 01:43 PM

hey Kovid, this is ready to be built in ;-)

Code:

class EuropeanVoice(BasicNewsRecipe):
    title          = u'European Voice'
    oldest_article = 14
    max_articles_per_feed = 100
    no_stylesheets = True
    cover_url = 'http://www.europeanvoice.com/Css/images/logo.gif'
    language = 'en'
    keep_only_tags    = [dict(name='div', attrs={'id':'articleLeftColumn'})]
    remove_tags    = [dict(name='div', attrs={'id':'BreadCrump'})]
    feeds          = [
                        (u'Whole site ',u'http://www.europeanvoice.com/Rss/2.xml'),
                          (u'News and analysis',u'http://www.europeanvoice.com/Rss/6.xml'),
                          (u'Comment',u'http://www.europeanvoice.com/Rss/7.xml'),
                          (u'Special reports',u'http://www.europeanvoice.com/Rss/5.xml'),
                          (u'People',u'http://www.europeanvoice.com/Rss/8.xml'),
                          (u'Career',u'http://www.europeanvoice.com/Rss/11.xml'),
                          (u'Policies',u'http://www.europeanvoice.com/Rss/4.xml'),
                          (u'EVents',u'http://www.europeanvoice.com/Rss/10.xml'),
                          (u'Policies - Economics',u'http://www.europeanvoice.com/Rss/31.xml'),
                          (u'Policies - Business',u'http://www.europeanvoice.com/Rss/19.xml'),
                          (u'Policies - Trade',u'http://www.europeanvoice.com/Rss/25.xml'),
                          (u'Policies - Information society',u'http://www.europeanvoice.com/Rss/20.xml'),
                          (u'Policies - Energy',u'http://www.europeanvoice.com/Rss/15.xml'),
                          (u'Policies - Transport',u'http://www.europeanvoice.com/Rss/18.xml'),
                          (u'Policies - Climate change',u'http://www.europeanvoice.com/Rss/16.xml'),
                          (u'Policies - Environment',u'http://www.europeanvoice.com/Rss/17.xml'),
                          (u'Policies - Farming & food',u'http://www.europeanvoice.com/Rss/23.xml'),
                          (u'Policies - Health & society',u'http://www.europeanvoice.com/Rss/24.xml'),
                          (u'Policies - Justice',u'http://www.europeanvoice.com/Rss/29.xml'),
                          (u'Policies - Foreign affairs',u'http://www.europeanvoice.com/Rss/27.xml')
                     ]
    extra_css = '''
        h1{font-family:Arial,Helvetica,sans-serif; font-weight:bold;font-size:large;}
        h2{font-family:Arial,Helvetica,sans-serif; font-weight:normal;font-size:small;}
        p{font-family:Arial,Helvetica,sans-serif;font-size:small;}
        body{font-family:Helvetica,Arial,sans-serif;font-size:small;}
        '''
    def print_version(self, url):
          return url + '?bPrint=1'
    def preprocess_html(self, soup):
           self.log('\t checking for subscriber only content')
           denied = soup.findAll(True,text='Subscribers')
           if denied:
                self.log('\t skipped, because content can only be seen with subscription')
                print something
           return soup

kovidgoyal · 11-10-2010, 01:56 PM

Will be in next release

11-08-2010, 03:56 PM	#4
marbs Zealot Posts: 122 Karma: 10 Join Date: Jul 2010 Device: nook	this was actually a problem for me and a solution for you. Code: def preprocess_html(self, soup): self.log('\t checking for subscriber only content') denied = soup.findAll(True,text='Subscribers') print denied return soup i think that might do the trick. edit: it will not do the trick. it will keep only the ones that need to be read as a subscriber. you need to inverse the find. in other words find a constant attribute in the other articles. write back if you didnt get it (its late here and i am not thinking straight). Last edited by marbs; 11-08-2010 at 04:06 PM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Skipping ahead / Index	LucentBeam	Amazon Kindle	5	10-28-2010 09:08 PM
Classic nook skipping pages	qpid360	Barnes & Noble NOOK	4	12-28-2009 04:31 PM
Skipping directly to 2.0.3?	whitearrow	Amazon Kindle	3	04-29-2009 09:42 PM
Firmware Update Any problems with skipping an update?	sbell1	Amazon Kindle	6	04-22-2009 05:00 PM
Automatically skipping to the end of the document?	senseabove	iRex	4	02-03-2009 01:36 AM

11-08-2010, 03:20 PM	#3
malfi Member Posts: 11 Karma: 14 Join Date: Nov 2010 Device: none	That is a different szenario, because they can exclude articles based on the information delivered in the feed. I need to download an article to see, if it should be skipped. That information is in the content, not based upon a character string in the title of the article or part of the URL as it's the case in your posted example

11-09-2010, 12:45 PM	#7
malfi Member Posts: 11 Karma: 14 Join Date: Nov 2010 Device: none	That is exactly what I wrote, except that you have "print somthing" instead of return None You example returns soup unconditionally, so I don't understand why that article would be skipped. What I wrote does somehow function. But the skipped articles are still in the index and displayed...

11-09-2010, 01:07 PM	#8
malfi Member Posts: 11 Karma: 14 Join Date: Nov 2010 Device: none	You're right, your solution works. But I don't understand why "print undefined_variable" has the wanted result while "return None" has not. Can you explain that? What do I need to do, to get that recipe included in the official calibre release?

11-09-2010, 03:32 PM	#9
kovidgoyal creator of calibre Posts: 43,866 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Just post the final recipe here.

11-09-2010, 03:58 PM	#10
marbs Zealot Posts: 122 Karma: 10 Join Date: Jul 2010 Device: nook	you were returning none. that means that when calibre was looking for the article and found none. that is still something. a none instance of soup. in my solution the recipe got to a stage that it said something like: article=postprocess_html(soup) article was expecting to get a soup instance back, instead postprocess exited in the middle(there is no veriable called "something"). when calibre went to collect all the articles from the places they were saved in, it went to the article we messed up and said "this does not exist at all, i must have not downloaded this file". so it was not indexed.... if you want to build in a recipe, post the full code here with a "hey Kovid, this is ready to be built in" or something along thous lines. edit: i see Kovid beat me to it. so ill throw in a "hey Kovid" my self. hey kovid

11-10-2010, 01:56 PM	#12
kovidgoyal creator of calibre Posts: 43,866 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Will be in next release