Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 11-07-2010, 04:49 PM   #1
malfi
Member
malfi began at the beginning.
 
Posts: 11
Karma: 14
Join Date: Nov 2010
Device: none
Thumbs down European Voice (by The Economist) - Problem with skipping

Hi, I fiddled around with python today and try to understand the mechanisms of the calibre framework.

The problem with European Voice is, that several linked articles from their feed can only be read as a subscriber. So I want to skip these articles as it becomes clear, that they don't contain any content. I thought the preprocess_html method could be used. If the word subscriber is in the soup, I just don't return the soup object but none.

Code:
class EuropeanVoice(BasicNewsRecipe):
    title          = u'European Voice'
    oldest_article = 14
    max_articles_per_feed = 100
    no_stylesheets = True
    cover_url = 'http://www.europeanvoice.com/Css/images/logo.gif'
    language = 'en'
    keep_only_tags    = [dict(name='div', attrs={'id':'articleLeftColumn'})]
    remove_tags    = [dict(name='div', attrs={'id':'BreadCrump'})]
    feeds          = [
                        (u'Whole site ',u'http://www.europeanvoice.com/Rss/2.xml'),
                          (u'News and analysis',u'http://www.europeanvoice.com/Rss/6.xml'),
                          (u'Comment',u'http://www.europeanvoice.com/Rss/7.xml'),
                          (u'Special reports',u'http://www.europeanvoice.com/Rss/5.xml'),
                          (u'People',u'http://www.europeanvoice.com/Rss/8.xml'),
                          (u'Career',u'http://www.europeanvoice.com/Rss/11.xml'),
                          (u'Policies',u'http://www.europeanvoice.com/Rss/4.xml'),
                          (u'EVents',u'http://www.europeanvoice.com/Rss/10.xml'),
                          (u'Policies - Economics',u'http://www.europeanvoice.com/Rss/31.xml'),
                          (u'Policies - Business',u'http://www.europeanvoice.com/Rss/19.xml'),
                          (u'Policies - Trade',u'http://www.europeanvoice.com/Rss/25.xml'),
                          (u'Policies - Information society',u'http://www.europeanvoice.com/Rss/20.xml'),
                          (u'Policies - Energy',u'http://www.europeanvoice.com/Rss/15.xml'),
                          (u'Policies - Transport',u'http://www.europeanvoice.com/Rss/18.xml'),
                          (u'Policies - Climate change',u'http://www.europeanvoice.com/Rss/16.xml'),
                          (u'Policies - Environment',u'http://www.europeanvoice.com/Rss/17.xml'),
                          (u'Policies - Farming & food',u'http://www.europeanvoice.com/Rss/23.xml'),
                          (u'Policies - Health & society',u'http://www.europeanvoice.com/Rss/24.xml'),
                          (u'Policies - Justice',u'http://www.europeanvoice.com/Rss/29.xml'),
                          (u'Policies - Foreign affairs',u'http://www.europeanvoice.com/Rss/27.xml')
                     ]
    extra_css = '''
        h1{font-family:Arial,Helvetica,sans-serif; font-weight:bold;font-size:large;}
        h2{font-family:Arial,Helvetica,sans-serif; font-weight:normal;font-size:small;}
        p{font-family:Arial,Helvetica,sans-serif;font-size:small;}
        body{font-family:Helvetica,Arial,sans-serif;font-size:small;}
        '''
    def print_version(self, url):
          return url + '?bPrint=1'
    def preprocess_html(self, soup):
           self.log('\t checking for subscriber only content')
           denied = soup.findAll(True,text='Subscribers')
           if denied:
                self.log('\t skipped, because content can only be seen with subscription')
                return None
           return soup
That doesn't really work, because these articles are merely not downloaded, but still in the index. How am I supposed to be able to skip articles comletely?
malfi is offline   Reply With Quote
Old 11-07-2010, 09:46 PM   #2
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by malfi View Post
That doesn't really work, because these articles are merely not downloaded, but still in the index. How am I supposed to be able to skip articles comletely?
See here:
https://www.mobileread.com/forums/sho...62&postcount=6
Starson17 is offline   Reply With Quote
Old 11-08-2010, 03:20 PM   #3
malfi
Member
malfi began at the beginning.
 
Posts: 11
Karma: 14
Join Date: Nov 2010
Device: none
That is a different szenario, because they can exclude articles based on the information delivered in the feed. I need to download an article to see, if it should be skipped. That information is in the content, not based upon a character string in the title of the article or part of the URL as it's the case in your posted example
malfi is offline   Reply With Quote
Old 11-08-2010, 03:56 PM   #4
marbs
Zealot
marbs began at the beginning.
 
Posts: 122
Karma: 10
Join Date: Jul 2010
Device: nook
this was actually a problem for me and a solution for you.

Code:
    def preprocess_html(self, soup):
           self.log('\t checking for subscriber only content')
           denied = soup.findAll(True,text='Subscribers')
           print denied
           return soup
i think that might do the trick.

edit:
it will not do the trick. it will keep only the ones that need to be read as a subscriber. you need to inverse the find. in other words find a constant attribute in the other articles.
write back if you didnt get it (its late here and i am not thinking straight).

Last edited by marbs; 11-08-2010 at 04:06 PM.
marbs is offline   Reply With Quote
Old 11-08-2010, 04:34 PM   #5
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by marbs View Post
you need to inverse the find. in other words find a constant attribute in the other articles.
write back if you didnt get it (its late here and i am not thinking straight).
Yes, it needs to be reversed. I believe he wants to return None on the articles he wants skipped. If an article is empty, Calibre will skip it, by default.
Starson17 is offline   Reply With Quote
Old 11-09-2010, 12:30 AM   #6
marbs
Zealot
marbs began at the beginning.
 
Posts: 122
Karma: 10
Join Date: Jul 2010
Device: nook
good morning!

this is the answer:

Code:
    def preprocess_html(self, soup):
           self.log('\t checking for subscriber only content')
           denied = soup.findAll(True,text='Subscribers')
           if denied:
                self.log('\t skipped, because content can only be seen with subscription')
                print somthing
           return soup
copy this to the code and this should work.
marbs is offline   Reply With Quote
Old 11-09-2010, 12:45 PM   #7
malfi
Member
malfi began at the beginning.
 
Posts: 11
Karma: 14
Join Date: Nov 2010
Device: none
That is exactly what I wrote, except that you have "print somthing" instead of return None

You example returns soup unconditionally, so I don't understand why that article would be skipped.

What I wrote does somehow function. But the skipped articles are still in the index and displayed...
malfi is offline   Reply With Quote
Old 11-09-2010, 01:07 PM   #8
malfi
Member
malfi began at the beginning.
 
Posts: 11
Karma: 14
Join Date: Nov 2010
Device: none
You're right, your solution works.

But I don't understand why "print undefined_variable" has the wanted result while "return None" has not. Can you explain that?

What do I need to do, to get that recipe included in the official calibre release?
malfi is offline   Reply With Quote
Old 11-09-2010, 03:32 PM   #9
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,866
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Just post the final recipe here.
kovidgoyal is offline   Reply With Quote
Old 11-09-2010, 03:58 PM   #10
marbs
Zealot
marbs began at the beginning.
 
Posts: 122
Karma: 10
Join Date: Jul 2010
Device: nook
you were returning none. that means that when calibre was looking for the article and found none. that is still something. a none instance of soup.
in my solution the recipe got to a stage that it said something like:
article=postprocess_html(soup)
article was expecting to get a soup instance back, instead postprocess exited in the middle(there is no veriable called "something").

when calibre went to collect all the articles from the places they were saved in, it went to the article we messed up and said "this does not exist at all, i must have not downloaded this file". so it was not indexed....

if you want to build in a recipe, post the full code here with a "hey Kovid, this is ready to be built in" or something along thous lines.

edit:
i see Kovid beat me to it. so ill throw in a "hey Kovid" my self.
hey kovid
marbs is offline   Reply With Quote
Old 11-10-2010, 01:43 PM   #11
malfi
Member
malfi began at the beginning.
 
Posts: 11
Karma: 14
Join Date: Nov 2010
Device: none
Lightbulb

hey Kovid, this is ready to be built in ;-)

Code:
class EuropeanVoice(BasicNewsRecipe):
    title          = u'European Voice'
    oldest_article = 14
    max_articles_per_feed = 100
    no_stylesheets = True
    cover_url = 'http://www.europeanvoice.com/Css/images/logo.gif'
    language = 'en'
    keep_only_tags    = [dict(name='div', attrs={'id':'articleLeftColumn'})]
    remove_tags    = [dict(name='div', attrs={'id':'BreadCrump'})]
    feeds          = [
                        (u'Whole site ',u'http://www.europeanvoice.com/Rss/2.xml'),
                          (u'News and analysis',u'http://www.europeanvoice.com/Rss/6.xml'),
                          (u'Comment',u'http://www.europeanvoice.com/Rss/7.xml'),
                          (u'Special reports',u'http://www.europeanvoice.com/Rss/5.xml'),
                          (u'People',u'http://www.europeanvoice.com/Rss/8.xml'),
                          (u'Career',u'http://www.europeanvoice.com/Rss/11.xml'),
                          (u'Policies',u'http://www.europeanvoice.com/Rss/4.xml'),
                          (u'EVents',u'http://www.europeanvoice.com/Rss/10.xml'),
                          (u'Policies - Economics',u'http://www.europeanvoice.com/Rss/31.xml'),
                          (u'Policies - Business',u'http://www.europeanvoice.com/Rss/19.xml'),
                          (u'Policies - Trade',u'http://www.europeanvoice.com/Rss/25.xml'),
                          (u'Policies - Information society',u'http://www.europeanvoice.com/Rss/20.xml'),
                          (u'Policies - Energy',u'http://www.europeanvoice.com/Rss/15.xml'),
                          (u'Policies - Transport',u'http://www.europeanvoice.com/Rss/18.xml'),
                          (u'Policies - Climate change',u'http://www.europeanvoice.com/Rss/16.xml'),
                          (u'Policies - Environment',u'http://www.europeanvoice.com/Rss/17.xml'),
                          (u'Policies - Farming & food',u'http://www.europeanvoice.com/Rss/23.xml'),
                          (u'Policies - Health & society',u'http://www.europeanvoice.com/Rss/24.xml'),
                          (u'Policies - Justice',u'http://www.europeanvoice.com/Rss/29.xml'),
                          (u'Policies - Foreign affairs',u'http://www.europeanvoice.com/Rss/27.xml')
                     ]
    extra_css = '''
        h1{font-family:Arial,Helvetica,sans-serif; font-weight:bold;font-size:large;}
        h2{font-family:Arial,Helvetica,sans-serif; font-weight:normal;font-size:small;}
        p{font-family:Arial,Helvetica,sans-serif;font-size:small;}
        body{font-family:Helvetica,Arial,sans-serif;font-size:small;}
        '''
    def print_version(self, url):
          return url + '?bPrint=1'
    def preprocess_html(self, soup):
           self.log('\t checking for subscriber only content')
           denied = soup.findAll(True,text='Subscribers')
           if denied:
                self.log('\t skipped, because content can only be seen with subscription')
                print something
           return soup
malfi is offline   Reply With Quote
Old 11-10-2010, 01:56 PM   #12
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,866
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Will be in next release
kovidgoyal is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Skipping ahead / Index LucentBeam Amazon Kindle 5 10-28-2010 09:08 PM
Classic nook skipping pages qpid360 Barnes & Noble NOOK 4 12-28-2009 04:31 PM
Skipping directly to 2.0.3? whitearrow Amazon Kindle 3 04-29-2009 09:42 PM
Firmware Update Any problems with skipping an update? sbell1 Amazon Kindle 6 04-22-2009 05:00 PM
Automatically skipping to the end of the document? senseabove iRex 4 02-03-2009 01:36 AM


All times are GMT -4. The time now is 08:28 AM.


MobileRead.com is a privately owned, operated and funded community.