Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 07-25-2012, 03:35 AM   #1
Steven630
Groupie
Steven630 began at the beginning.
 
Posts: 154
Karma: 10
Join Date: May 2012
Device: Kindle Paperwhite2
Sorry that I've posted so many questions here recently. I'm trying to make a recipe using beautifulsoup. But when I use findAll to find all the articles, only the first article was found. There's otherwise no problem with the recipe

Update: The latest version of Calibre has solved the problem. Many thanks to the author, who fixed the bug and NotTaken, who provided a way to get around the issue in the old version.

Last edited by Steven630; 08-11-2012 at 07:46 AM.
Steven630 is offline   Reply With Quote
Old 07-27-2012, 07:35 AM   #2
cryzed
Evangelist
cryzed ought to be getting tired of karma fortunes by now.cryzed ought to be getting tired of karma fortunes by now.cryzed ought to be getting tired of karma fortunes by now.cryzed ought to be getting tired of karma fortunes by now.cryzed ought to be getting tired of karma fortunes by now.cryzed ought to be getting tired of karma fortunes by now.cryzed ought to be getting tired of karma fortunes by now.cryzed ought to be getting tired of karma fortunes by now.cryzed ought to be getting tired of karma fortunes by now.cryzed ought to be getting tired of karma fortunes by now.cryzed ought to be getting tired of karma fortunes by now.
 
cryzed's Avatar
 
Posts: 408
Karma: 1050547
Join Date: Mar 2011
Device: Kindle Oasis 2
Very strange, this works for me:
PHP Code:
for section in soup.findAll('section'):
    for 
post in section.findAll('a', {'class''package-link'}):
        print 
post 
Are you sure that you aren't somehow terminating your for-loop prematurely or have some specific control structures that prevent the post from getting printed? More code would be helpful.
cryzed is offline   Reply With Quote
Advert
Old 07-27-2012, 08:23 AM   #3
Steven630
Groupie
Steven630 began at the beginning.
 
Posts: 154
Karma: 10
Join Date: May 2012
Device: Kindle Paperwhite2
Quote:
Originally Posted by cryzed View Post
Are you sure that you aren't somehow terminating your for-loop prematurely or have some specific control structures that prevent the post from getting printed? More code would be helpful.
Thank you. I've been waiting for someone to reply. I don't think so. I've repeated checked the code and even tried findALL 'a' to no avail. Here's the recipe. Would you please take a look?

It now works fine even with the original recipe

Last edited by Steven630; 08-11-2012 at 07:45 AM.
Steven630 is offline   Reply With Quote
Old 07-27-2012, 09:14 AM   #4
cryzed
Evangelist
cryzed ought to be getting tired of karma fortunes by now.cryzed ought to be getting tired of karma fortunes by now.cryzed ought to be getting tired of karma fortunes by now.cryzed ought to be getting tired of karma fortunes by now.cryzed ought to be getting tired of karma fortunes by now.cryzed ought to be getting tired of karma fortunes by now.cryzed ought to be getting tired of karma fortunes by now.cryzed ought to be getting tired of karma fortunes by now.cryzed ought to be getting tired of karma fortunes by now.cryzed ought to be getting tired of karma fortunes by now.cryzed ought to be getting tired of karma fortunes by now.
 
cryzed's Avatar
 
Posts: 408
Karma: 1050547
Join Date: Mar 2011
Device: Kindle Oasis 2
PHP Code:
for post in section.findAll(attrs={'class':'package-link'}): 
You are missing the tag argument when you call "section.findAll". It should look like this:
PHP Code:
for post in section.findAll('a', {'class''package-link'}): 
Also using the "attrs" keyword parameter is not needed in this case, just do your calls without them. Additionally the line that checks "if post is None" will never equal true, or at least should not, you should be able to remove that statement entirely.

In case this is just a bug that you introduced while trying to find the actual mistake I suggest you check all your control structures that cause the current loop to continue prematurely, meaning: print the variables you are checking against and see what their value is, this will hopefully show you where and why the unexpected behavior occurs.

Something else I noticed is that when you filter the tags by their class name you sometimes pass a list instead of a simple string, i.e. like you did in your original post -- this is not needed and I'm not sure if it even works. (I mean the square brackets around e.g. "package-link")

Last edited by cryzed; 07-27-2012 at 09:39 AM.
cryzed is offline   Reply With Quote
Old 07-27-2012, 09:51 AM   #5
Steven630
Groupie
Steven630 began at the beginning.
 
Posts: 154
Karma: 10
Join Date: May 2012
Device: Kindle Paperwhite2
You're right. But omitting the tag argument, if anything, would get more results, not less.Because it will find all tags that match the class attribute. I did have tag 'a' added at first, but that did't work. That is why I removed it and tried to find out if anything would be different. Square brackets are indeed unnecessary. Still, they don't affect the result. I have some other recipes written in this way without problems. Thank you for checking out the code. Did you run the recipe with Calibre? It finds the first article of each section however I modify the recipe, with or without the brackets or the other unnecessary parts. Since all the sections are found I guess it's not a problem that the recipe is suddenly terminated. I am really at a loss.

Last edited by Steven630; 07-27-2012 at 09:58 AM.
Steven630 is offline   Reply With Quote
Advert
Old 07-27-2012, 10:45 AM   #6
cryzed
Evangelist
cryzed ought to be getting tired of karma fortunes by now.cryzed ought to be getting tired of karma fortunes by now.cryzed ought to be getting tired of karma fortunes by now.cryzed ought to be getting tired of karma fortunes by now.cryzed ought to be getting tired of karma fortunes by now.cryzed ought to be getting tired of karma fortunes by now.cryzed ought to be getting tired of karma fortunes by now.cryzed ought to be getting tired of karma fortunes by now.cryzed ought to be getting tired of karma fortunes by now.cryzed ought to be getting tired of karma fortunes by now.cryzed ought to be getting tired of karma fortunes by now.
 
cryzed's Avatar
 
Posts: 408
Karma: 1050547
Join Date: Mar 2011
Device: Kindle Oasis 2
Okay, I've never used the recipe feature of Calibre before, the only thing I know is that it works outside of Calibre. I did what I suggested to you and wrote the section HTML to a file and checked its contents:
PHP Code:
<h1 class="fly-title">Business</h1>
<
article>
<
h2><a href="/node/21537920" class="package-link">Virgin territory</a></h2>
</
article
I've got no clue why there's only one article in the scraped section, possibly some Calibre specific stripping of tags, an error during parsing due to outdated modules (BeautifulSoup instead of bs4), I'm sorry, maybe ask someone more familiar with the Calibre specific scraping.
cryzed is offline   Reply With Quote
Old 07-27-2012, 11:35 AM   #7
Steven630
Groupie
Steven630 began at the beginning.
 
Posts: 154
Karma: 10
Join Date: May 2012
Device: Kindle Paperwhite2
Thanks. That's why I am so confused. It is supposed to work. Removing tags and so on deals with downloaded contents, not what is downloaded in the first place. Then there seems to be nothing left that can affect the scraping...
Steven630 is offline   Reply With Quote
Old 07-27-2012, 11:46 AM   #8
cryzed
Evangelist
cryzed ought to be getting tired of karma fortunes by now.cryzed ought to be getting tired of karma fortunes by now.cryzed ought to be getting tired of karma fortunes by now.cryzed ought to be getting tired of karma fortunes by now.cryzed ought to be getting tired of karma fortunes by now.cryzed ought to be getting tired of karma fortunes by now.cryzed ought to be getting tired of karma fortunes by now.cryzed ought to be getting tired of karma fortunes by now.cryzed ought to be getting tired of karma fortunes by now.cryzed ought to be getting tired of karma fortunes by now.cryzed ought to be getting tired of karma fortunes by now.
 
cryzed's Avatar
 
Posts: 408
Karma: 1050547
Join Date: Mar 2011
Device: Kindle Oasis 2
As I said, it might very well be an outdated version of BeautifulSoup which does not use use html5lib as its default backend parser. What you could try to do is install the beautifulsoup4 module and then import and use it manually or contact the Calibre author and request that he updates his 3rd party modules.
cryzed is offline   Reply With Quote
Old 07-27-2012, 11:59 AM   #9
Steven630
Groupie
Steven630 began at the beginning.
 
Posts: 154
Karma: 10
Join Date: May 2012
Device: Kindle Paperwhite2
I guess so. Don't know if kovidgoyal would see this thread or look into the issue. He's quite busy. Thank you for all that. I really appreciate it.
Steven630 is offline   Reply With Quote
Old 07-27-2012, 09:44 PM   #10
NotTaken
Connoisseur
NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.
 
Posts: 65
Karma: 4640
Join Date: Aug 2011
Device: kindle
The unordered list is being moved from within the section tags to after (possibly due to some non-compliant nesting). You could try awork around like this:

Code:
    def find_articles(self,section):
        for post in section.findAll(attrs={'class':'package-link'}):
                title = self.tag_to_string(post)
                url = post['href']
                if url.startswith('/'): url = 'http://www.economist.com'+url+'/print'
                self.log('\tFound article:', title, 'at', url)
                yield {'title':title, 'url':url, 'description':'',
                    'date':''}
            

    def parse_index(self):
        soup = self.index_to_soup(self.INDEX)
        feeds = []
        for section in soup.findAll('section'):
            h1 = section.find('h1')
            if h1 is None:
                continue
            section_title = self.tag_to_string(h1).strip()
            self.log('Found section: %s'%section_title)
            
            articles = []
            articles.extend(self.find_articles(section))

            ul = section.findNextSibling('ul')
            if ul:
                articles.extend(self.find_articles(ul))
            
            if articles:
                feeds.append((section_title, articles))
        return feeds
NotTaken is offline   Reply With Quote
Old 07-28-2012, 12:08 AM   #11
Steven630
Groupie
Steven630 began at the beginning.
 
Posts: 154
Karma: 10
Join Date: May 2012
Device: Kindle Paperwhite2
Quote:
Originally Posted by NotTaken View Post
The unordered list is being moved from within the section tags to after (possibly due to some non-compliant nesting). You could try awork around like this:
Nice to see you again! It worked though I still can't figure out why the original recipe failed.
Steven630 is offline   Reply With Quote
Old 08-18-2012, 09:51 AM   #12
lrui
Enthusiast
lrui ought to be getting tired of karma fortunes by now.lrui ought to be getting tired of karma fortunes by now.lrui ought to be getting tired of karma fortunes by now.lrui ought to be getting tired of karma fortunes by now.lrui ought to be getting tired of karma fortunes by now.lrui ought to be getting tired of karma fortunes by now.lrui ought to be getting tired of karma fortunes by now.lrui ought to be getting tired of karma fortunes by now.lrui ought to be getting tired of karma fortunes by now.lrui ought to be getting tired of karma fortunes by now.lrui ought to be getting tired of karma fortunes by now.
 
lrui's Avatar
 
Posts: 49
Karma: 475062
Join Date: Aug 2012
Device: nook simple touch
Quote:
Originally Posted by Steven630 View Post
Sorry that I've posted so many questions here recently. I'm trying to make a recipe using beautifulsoup. But when I use findAll to find all the articles, only the first article was found. There's otherwise no problem with the recipe

Update: The latest version of Calibre has solved the problem. Many thanks to the author, who fixed the bug and NotTaken, who provided a way to get around the issue in the old version.
I still have this problem like this
Code:
for section in soup.findAll(attrs={'class':['topnews','left','right']}):
              section_title = self.tag_to_string(section.findAll(['span','h2']))
and then i got the error,I dont know why.
AttributeError: 'ResultSet' object has no attribute 'contents'

If I changed section.findAll() into section.find(),it's ok,but only downloaded the first article what was found.

Last edited by lrui; 08-18-2012 at 10:26 AM.
lrui is offline   Reply With Quote
Old 08-18-2012, 11:49 AM   #13
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,844
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
findAll returns a set of tags you need to pass a single tag to tag_to_string
kovidgoyal is offline   Reply With Quote
Old 08-19-2012, 02:44 AM   #14
lrui
Enthusiast
lrui ought to be getting tired of karma fortunes by now.lrui ought to be getting tired of karma fortunes by now.lrui ought to be getting tired of karma fortunes by now.lrui ought to be getting tired of karma fortunes by now.lrui ought to be getting tired of karma fortunes by now.lrui ought to be getting tired of karma fortunes by now.lrui ought to be getting tired of karma fortunes by now.lrui ought to be getting tired of karma fortunes by now.lrui ought to be getting tired of karma fortunes by now.lrui ought to be getting tired of karma fortunes by now.lrui ought to be getting tired of karma fortunes by now.
 
lrui's Avatar
 
Posts: 49
Karma: 475062
Join Date: Aug 2012
Device: nook simple touch
Quote:
Originally Posted by kovidgoyal View Post
findAll returns a set of tags you need to pass a single tag to tag_to_string
Got it,thanks.
I make some change,it works.
lrui is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
usbNetwork doesn't work kimkindle Kindle Developer's Corner 18 02-23-2012 01:16 AM
why doesn't this CSS work? Barty Conversion 1 02-09-2011 09:28 PM
keep_only_tags and findAll boocko Recipes 3 11-18-2010 11:59 AM
Hacks KindlePID doesn't work with DX Talldog Amazon Kindle 48 08-13-2010 08:02 AM
Zoom doesn't work for me in v2.7.1 Mika iRex 7 11-18-2006 11:27 AM


All times are GMT -4. The time now is 11:00 PM.


MobileRead.com is a privately owned, operated and funded community.