Beautiful soup findAll doesn't seem to work

Steven630 · 07-25-2012, 03:35 AM

Sorry that I've posted so many questions here recently. I'm trying to make a recipe using beautifulsoup. But when I use findAll to find all the articles, only the first article was found. There's otherwise no problem with the recipe

Update: The latest version of Calibre has solved the problem. Many thanks to the author, who fixed the bug and NotTaken, who provided a way to get around the issue in the old version.

cryzed · 07-27-2012, 07:35 AM

Very strange, this works for me:

PHP Code:


			
for section in soup.findAll('section'):

    for post in section.findAll('a', {'class': 'package-link'}):

        print post

Are you sure that you aren't somehow terminating your for-loop prematurely or have some specific control structures that prevent the post from getting printed? More code would be helpful.

Steven630 · 07-27-2012, 08:23 AM

Quote:

Originally Posted by cryzed

Are you sure that you aren't somehow terminating your for-loop prematurely or have some specific control structures that prevent the post from getting printed? More code would be helpful.

Thank you. I've been waiting for someone to reply. I don't think so. I've repeated checked the code and even tried findALL 'a' to no avail. Here's the recipe. Would you please take a look?

It now works fine even with the original recipe

cryzed · 07-27-2012, 09:14 AM

PHP Code:


			
for post in section.findAll(attrs={'class':'package-link'}):

You are missing the tag argument when you call "section.findAll". It should look like this:

PHP Code:


			
for post in section.findAll('a', {'class': 'package-link'}):

Also using the "attrs" keyword parameter is not needed in this case, just do your calls without them. Additionally the line that checks "if post is None" will never equal true, or at least should not, you should be able to remove that statement entirely.

In case this is just a bug that you introduced while trying to find the actual mistake I suggest you check all your control structures that cause the current loop to continue prematurely, meaning: print the variables you are checking against and see what their value is, this will hopefully show you where and why the unexpected behavior occurs.

Something else I noticed is that when you filter the tags by their class name you sometimes pass a list instead of a simple string, i.e. like you did in your original post -- this is not needed and I'm not sure if it even works. (I mean the square brackets around e.g. "package-link")

Steven630 · 07-27-2012, 09:51 AM

You're right. But omitting the tag argument, if anything, would get more results, not less.Because it will find all tags that match the class attribute. I did have tag 'a' added at first, but that did't work. That is why I removed it and tried to find out if anything would be different. Square brackets are indeed unnecessary. Still, they don't affect the result. I have some other recipes written in this way without problems. Thank you for checking out the code. Did you run the recipe with Calibre? It finds the first article of each section however I modify the recipe, with or without the brackets or the other unnecessary parts. Since all the sections are found I guess it's not a problem that the recipe is suddenly terminated. I am really at a loss.

cryzed · 07-27-2012, 10:45 AM

Okay, I've never used the recipe feature of Calibre before, the only thing I know is that it works outside of Calibre. I did what I suggested to you and wrote the section HTML to a file and checked its contents:

PHP Code:


			
<h1 class="fly-title">Business</h1>

<article>

<h2><a href="/node/21537920" class="package-link">Virgin territory</a></h2>

</article>

I've got no clue why there's only one article in the scraped section, possibly some Calibre specific stripping of tags, an error during parsing due to outdated modules (BeautifulSoup instead of bs4), I'm sorry, maybe ask someone more familiar with the Calibre specific scraping.

Steven630 · 07-27-2012, 11:35 AM

Thanks. That's why I am so confused. It is supposed to work. Removing tags and so on deals with downloaded contents, not what is downloaded in the first place. Then there seems to be nothing left that can affect the scraping...

cryzed · 07-27-2012, 11:46 AM

As I said, it might very well be an outdated version of BeautifulSoup which does not use use html5lib as its default backend parser. What you could try to do is install the beautifulsoup4 module and then import and use it manually or contact the Calibre author and request that he updates his 3rd party modules.

Steven630 · 07-27-2012, 11:59 AM

I guess so. Don't know if kovidgoyal would see this thread or look into the issue. He's quite busy. Thank you for all that. I really appreciate it.

NotTaken · 07-27-2012, 09:44 PM

The unordered list is being moved from within the section tags to after (possibly due to some non-compliant nesting). You could try awork around like this:

Code:

    def find_articles(self,section):
        for post in section.findAll(attrs={'class':'package-link'}):
                title = self.tag_to_string(post)
                url = post['href']
                if url.startswith('/'): url = 'http://www.economist.com'+url+'/print'
                self.log('\tFound article:', title, 'at', url)
                yield {'title':title, 'url':url, 'description':'',
                    'date':''}
            

    def parse_index(self):
        soup = self.index_to_soup(self.INDEX)
        feeds = []
        for section in soup.findAll('section'):
            h1 = section.find('h1')
            if h1 is None:
                continue
            section_title = self.tag_to_string(h1).strip()
            self.log('Found section: %s'%section_title)
            
            articles = []
            articles.extend(self.find_articles(section))

            ul = section.findNextSibling('ul')
            if ul:
                articles.extend(self.find_articles(ul))
            
            if articles:
                feeds.append((section_title, articles))
        return feeds

Steven630 · 07-28-2012, 12:08 AM

Quote:

Originally Posted by NotTaken

The unordered list is being moved from within the section tags to after (possibly due to some non-compliant nesting). You could try awork around like this:

Nice to see you again!

It worked though I still can't figure out why the original recipe failed.

lrui · 08-18-2012, 09:51 AM

Quote:

Originally Posted by Steven630

Sorry that I've posted so many questions here recently. I'm trying to make a recipe using beautifulsoup. But when I use findAll to find all the articles, only the first article was found. There's otherwise no problem with the recipe

Update: The latest version of Calibre has solved the problem. Many thanks to the author, who fixed the bug and NotTaken, who provided a way to get around the issue in the old version.

I still have this problem like this

Code:

for section in soup.findAll(attrs={'class':['topnews','left','right']}):
              section_title = self.tag_to_string(section.findAll(['span','h2']))

and then i got the error,I dont know why.
AttributeError: 'ResultSet' object has no attribute 'contents'

If I changed section.findAll() into section.find(),it's ok,but only downloaded the first article what was found.

kovidgoyal · 08-18-2012, 11:49 AM

findAll returns a set of tags you need to pass a single tag to tag_to_string

lrui · 08-19-2012, 02:44 AM

Quote:

Originally Posted by kovidgoyal

findAll returns a set of tags you need to pass a single tag to tag_to_string

Got it,thanks.
I make some change,it works.

07-25-2012, 03:35 AM	#1
Steven630 Groupie Posts: 154 Karma: 10 Join Date: May 2012 Device: Kindle Paperwhite2	Sorry that I've posted so many questions here recently. I'm trying to make a recipe using beautifulsoup. But when I use findAll to find all the articles, only the first article was found. There's otherwise no problem with the recipe Update: The latest version of Calibre has solved the problem. Many thanks to the author, who fixed the bug and NotTaken, who provided a way to get around the issue in the old version. Last edited by Steven630; 08-11-2012 at 07:46 AM.

07-27-2012, 07:35 AM	#2
cryzed Evangelist Posts: 408 Karma: 1050547 Join Date: Mar 2011 Device: Kindle Oasis 2	Very strange, this works for me: PHP Code: `for section in soup.findAll('section'): for post in section.findAll('a', {'class': 'package-link'}): print post` Are you sure that you aren't somehow terminating your for-loop prematurely or have some specific control structures that prevent the post from getting printed? More code would be helpful.

07-27-2012, 09:14 AM	#4
cryzed Evangelist Posts: 408 Karma: 1050547 Join Date: Mar 2011 Device: Kindle Oasis 2	PHP Code: `for post in section.findAll(attrs={'class':'package-link'}):` You are missing the tag argument when you call "section.findAll". It should look like this: PHP Code: `for post in section.findAll('a', {'class': 'package-link'}):` Also using the "attrs" keyword parameter is not needed in this case, just do your calls without them. Additionally the line that checks "if post is None" will never equal true, or at least should not, you should be able to remove that statement entirely. In case this is just a bug that you introduced while trying to find the actual mistake I suggest you check all your control structures that cause the current loop to continue prematurely, meaning: print the variables you are checking against and see what their value is, this will hopefully show you where and why the unexpected behavior occurs. Something else I noticed is that when you filter the tags by their class name you sometimes pass a list instead of a simple string, i.e. like you did in your original post -- this is not needed and I'm not sure if it even works. (I mean the square brackets around e.g. "package-link") Last edited by cryzed; 07-27-2012 at 09:39 AM.

07-27-2012, 09:51 AM	#5
Steven630 Groupie Posts: 154 Karma: 10 Join Date: May 2012 Device: Kindle Paperwhite2	You're right. But omitting the tag argument, if anything, would get more results, not less.Because it will find all tags that match the class attribute. I did have tag 'a' added at first, but that did't work. That is why I removed it and tried to find out if anything would be different. Square brackets are indeed unnecessary. Still, they don't affect the result. I have some other recipes written in this way without problems. Thank you for checking out the code. Did you run the recipe with Calibre? It finds the first article of each section however I modify the recipe, with or without the brackets or the other unnecessary parts. Since all the sections are found I guess it's not a problem that the recipe is suddenly terminated. I am really at a loss. Last edited by Steven630; 07-27-2012 at 09:58 AM.

07-27-2012, 10:45 AM	#6
cryzed Evangelist Posts: 408 Karma: 1050547 Join Date: Mar 2011 Device: Kindle Oasis 2	Okay, I've never used the recipe feature of Calibre before, the only thing I know is that it works outside of Calibre. I did what I suggested to you and wrote the section HTML to a file and checked its contents: PHP Code: `<h1 class="fly-title">Business</h1> <article> <h2><a href="/node/21537920" class="package-link">Virgin territory</a></h2> </article>` I've got no clue why there's only one article in the scraped section, possibly some Calibre specific stripping of tags, an error during parsing due to outdated modules (BeautifulSoup instead of bs4), I'm sorry, maybe ask someone more familiar with the Calibre specific scraping.

07-27-2012, 11:35 AM	#7
Steven630 Groupie Posts: 154 Karma: 10 Join Date: May 2012 Device: Kindle Paperwhite2	Thanks. That's why I am so confused. It is supposed to work. Removing tags and so on deals with downloaded contents, not what is downloaded in the first place. Then there seems to be nothing left that can affect the scraping...

07-27-2012, 11:46 AM	#8
cryzed Evangelist Posts: 408 Karma: 1050547 Join Date: Mar 2011 Device: Kindle Oasis 2	As I said, it might very well be an outdated version of BeautifulSoup which does not use use html5lib as its default backend parser. What you could try to do is install the beautifulsoup4 module and then import and use it manually or contact the Calibre author and request that he updates his 3rd party modules.

07-27-2012, 11:59 AM	#9
Steven630 Groupie Posts: 154 Karma: 10 Join Date: May 2012 Device: Kindle Paperwhite2	I guess so. Don't know if kovidgoyal would see this thread or look into the issue. He's quite busy. Thank you for all that. I really appreciate it.

08-18-2012, 11:49 AM	#13
kovidgoyal creator of calibre Posts: 43,844 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	findAll returns a set of tags you need to pass a single tag to tag_to_string

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
usbNetwork doesn't work	kimkindle	Kindle Developer's Corner	18	02-23-2012 01:16 AM
why doesn't this CSS work?	Barty	Conversion	1	02-09-2011 09:28 PM
keep_only_tags and findAll	boocko	Recipes	3	11-18-2010 11:59 AM
Hacks KindlePID doesn't work with DX	Talldog	Amazon Kindle	48	08-13-2010 08:02 AM
Zoom doesn't work for me in v2.7.1	Mika	iRex	7	11-18-2006 11:27 AM

Advert

Advert