07-25-2012, 03:35 AM | #1 |
Groupie
Posts: 154
Karma: 10
Join Date: May 2012
Device: Kindle Paperwhite2
|
Sorry that I've posted so many questions here recently. I'm trying to make a recipe using beautifulsoup. But when I use findAll to find all the articles, only the first article was found. There's otherwise no problem with the recipe
Update: The latest version of Calibre has solved the problem. Many thanks to the author, who fixed the bug and NotTaken, who provided a way to get around the issue in the old version. Last edited by Steven630; 08-11-2012 at 07:46 AM. |
07-27-2012, 07:35 AM | #2 |
Evangelist
Posts: 408
Karma: 1050547
Join Date: Mar 2011
Device: Kindle Oasis 2
|
Very strange, this works for me:
PHP Code:
|
Advert | |
|
07-27-2012, 08:23 AM | #3 | |
Groupie
Posts: 154
Karma: 10
Join Date: May 2012
Device: Kindle Paperwhite2
|
Quote:
It now works fine even with the original recipe Last edited by Steven630; 08-11-2012 at 07:45 AM. |
|
07-27-2012, 09:14 AM | #4 |
Evangelist
Posts: 408
Karma: 1050547
Join Date: Mar 2011
Device: Kindle Oasis 2
|
PHP Code:
PHP Code:
In case this is just a bug that you introduced while trying to find the actual mistake I suggest you check all your control structures that cause the current loop to continue prematurely, meaning: print the variables you are checking against and see what their value is, this will hopefully show you where and why the unexpected behavior occurs. Something else I noticed is that when you filter the tags by their class name you sometimes pass a list instead of a simple string, i.e. like you did in your original post -- this is not needed and I'm not sure if it even works. (I mean the square brackets around e.g. "package-link") Last edited by cryzed; 07-27-2012 at 09:39 AM. |
07-27-2012, 09:51 AM | #5 |
Groupie
Posts: 154
Karma: 10
Join Date: May 2012
Device: Kindle Paperwhite2
|
You're right. But omitting the tag argument, if anything, would get more results, not less.Because it will find all tags that match the class attribute. I did have tag 'a' added at first, but that did't work. That is why I removed it and tried to find out if anything would be different. Square brackets are indeed unnecessary. Still, they don't affect the result. I have some other recipes written in this way without problems. Thank you for checking out the code. Did you run the recipe with Calibre? It finds the first article of each section however I modify the recipe, with or without the brackets or the other unnecessary parts. Since all the sections are found I guess it's not a problem that the recipe is suddenly terminated. I am really at a loss.
Last edited by Steven630; 07-27-2012 at 09:58 AM. |
Advert | |
|
07-27-2012, 10:45 AM | #6 |
Evangelist
Posts: 408
Karma: 1050547
Join Date: Mar 2011
Device: Kindle Oasis 2
|
Okay, I've never used the recipe feature of Calibre before, the only thing I know is that it works outside of Calibre. I did what I suggested to you and wrote the section HTML to a file and checked its contents:
PHP Code:
|
07-27-2012, 11:35 AM | #7 |
Groupie
Posts: 154
Karma: 10
Join Date: May 2012
Device: Kindle Paperwhite2
|
Thanks. That's why I am so confused. It is supposed to work. Removing tags and so on deals with downloaded contents, not what is downloaded in the first place. Then there seems to be nothing left that can affect the scraping...
|
07-27-2012, 11:46 AM | #8 |
Evangelist
Posts: 408
Karma: 1050547
Join Date: Mar 2011
Device: Kindle Oasis 2
|
As I said, it might very well be an outdated version of BeautifulSoup which does not use use html5lib as its default backend parser. What you could try to do is install the beautifulsoup4 module and then import and use it manually or contact the Calibre author and request that he updates his 3rd party modules.
|
07-27-2012, 11:59 AM | #9 |
Groupie
Posts: 154
Karma: 10
Join Date: May 2012
Device: Kindle Paperwhite2
|
I guess so. Don't know if kovidgoyal would see this thread or look into the issue. He's quite busy. Thank you for all that. I really appreciate it.
|
07-27-2012, 09:44 PM | #10 |
Connoisseur
Posts: 65
Karma: 4640
Join Date: Aug 2011
Device: kindle
|
The unordered list is being moved from within the section tags to after (possibly due to some non-compliant nesting). You could try awork around like this:
Code:
def find_articles(self,section): for post in section.findAll(attrs={'class':'package-link'}): title = self.tag_to_string(post) url = post['href'] if url.startswith('/'): url = 'http://www.economist.com'+url+'/print' self.log('\tFound article:', title, 'at', url) yield {'title':title, 'url':url, 'description':'', 'date':''} def parse_index(self): soup = self.index_to_soup(self.INDEX) feeds = [] for section in soup.findAll('section'): h1 = section.find('h1') if h1 is None: continue section_title = self.tag_to_string(h1).strip() self.log('Found section: %s'%section_title) articles = [] articles.extend(self.find_articles(section)) ul = section.findNextSibling('ul') if ul: articles.extend(self.find_articles(ul)) if articles: feeds.append((section_title, articles)) return feeds |
07-28-2012, 12:08 AM | #11 |
Groupie
Posts: 154
Karma: 10
Join Date: May 2012
Device: Kindle Paperwhite2
|
|
08-18-2012, 09:51 AM | #12 | |
Enthusiast
Posts: 49
Karma: 475062
Join Date: Aug 2012
Device: nook simple touch
|
Quote:
Code:
for section in soup.findAll(attrs={'class':['topnews','left','right']}):
section_title = self.tag_to_string(section.findAll(['span','h2']))
AttributeError: 'ResultSet' object has no attribute 'contents' If I changed section.findAll() into section.find(),it's ok,but only downloaded the first article what was found. Last edited by lrui; 08-18-2012 at 10:26 AM. |
|
08-18-2012, 11:49 AM | #13 |
creator of calibre
Posts: 43,844
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
findAll returns a set of tags you need to pass a single tag to tag_to_string
|
08-19-2012, 02:44 AM | #14 |
Enthusiast
Posts: 49
Karma: 475062
Join Date: Aug 2012
Device: nook simple touch
|
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
usbNetwork doesn't work | kimkindle | Kindle Developer's Corner | 18 | 02-23-2012 01:16 AM |
why doesn't this CSS work? | Barty | Conversion | 1 | 02-09-2011 09:28 PM |
keep_only_tags and findAll | boocko | Recipes | 3 | 11-18-2010 11:59 AM |
Hacks KindlePID doesn't work with DX | Talldog | Amazon Kindle | 48 | 08-13-2010 08:02 AM |
Zoom doesn't work for me in v2.7.1 | Mika | iRex | 7 | 11-18-2006 11:27 AM |