MobileRead Forums - View Single Post

ireadtheinternet · 11-15-2014, 01:16 PM

Ever heard of self.index_to_soup returning a soup with a HEAD but not a BODY?

I had my recipe almost working, and somehow either I broke it during of my fast and furious edits (very likely) but it seems a little like the site was redesigned slightly too, because I can't get any older versions of the recipe to work either (a few were on my Windows 7 Previous Versions tab). So I am happy to take my lumps for not bothering with version control and rebuild this little recipe again piece by piece but I noticed something that seemed odd to me and I wondered if this sounded familiar at all. I seem to be getting a soup back from index_to_soup that only has the HEAD and not the BODY. It was working fine until either I broke it (probably) or the IMDB site broke it.

Code:

    def build_section(self, url):
        
        # this method is called from parse_index(self)

        articles = []
        section_toc = self.index_to_soup(url) 

        # confirms this gets the page but ends with </HEAD> </HTML> no "BODY"
        print(url)
        print (section_toc.prettify())
        
        movies_on_page = section_toc.find(name='div', attrs={'id': 'main'})

        if movies_on_page is None:
            print ('movies_on_page is None')

        # there's more to it, but it is not going to work, if the above won't

I noticed there is some funky Javascript that seems to outputting the body including !DOCTYPE directives, I don't know if it was always that way but regardless of that, it seems to still work fine using BeautifulSoup3 outside of a Calbre recipe. The following code has the expected output, but the same loop in my recipe can't find the #main div or even any <a> links:

Code:

import requests
from BeautifulSoup import BeautifulSoup #bs3

URL = 'http://www.imdb.com/search/title?sort=year,desc&'
'production_status=released&title_type=feature'

r = requests.get(URL)
section_toc = BeautifulSoup(r.text)
print (section_toc.prettify())

for i, tag in enumerate(section_toc.findAll('a')):
    print(tag['href'])

main = section_toc.find(name='div', attrs={'id': 'main'})

for link in main.findAll('a'):
    print(link['href'])
    print(link.string)

By the way, I commented out my keep_only_tags and remove_tags sections in my recipe to try to rule that out as well, no luck, my code still can't find the #main DIV, or any links in the original soup.

I didn't post the whole recipe because it's about 130 lines now and it's my responsibility to find what it is probably a logic error I introduced somewhere -- just wanted to see if this sounds like something anyone has heard of before.

11-15-2014, 01:16 PM	#1
ireadtheinternet Member Posts: 21 Karma: 10 Join Date: Oct 2014 Device: Android	soup with a HEAD but no BODY? Ever heard of self.index_to_soup returning a soup with a HEAD but not a BODY? I had my recipe almost working, and somehow either I broke it during of my fast and furious edits (very likely) but it seems a little like the site was redesigned slightly too, because I can't get any older versions of the recipe to work either (a few were on my Windows 7 Previous Versions tab). So I am happy to take my lumps for not bothering with version control and rebuild this little recipe again piece by piece but I noticed something that seemed odd to me and I wondered if this sounded familiar at all. I seem to be getting a soup back from index_to_soup that only has the HEAD and not the BODY. It was working fine until either I broke it (probably) or the IMDB site broke it. Code: def build_section(self, url): # this method is called from parse_index(self) articles = [] section_toc = self.index_to_soup(url) # confirms this gets the page but ends with </HEAD> </HTML> no "BODY" print(url) print (section_toc.prettify()) movies_on_page = section_toc.find(name='div', attrs={'id': 'main'}) if movies_on_page is None: print ('movies_on_page is None') # there's more to it, but it is not going to work, if the above won't I noticed there is some funky Javascript that seems to outputting the body including !DOCTYPE directives, I don't know if it was always that way but regardless of that, it seems to still work fine using BeautifulSoup3 outside of a Calbre recipe. The following code has the expected output, but the same loop in my recipe can't find the #main div or even any <a> links: Code: import requests from BeautifulSoup import BeautifulSoup #bs3 URL = 'http://www.imdb.com/search/title?sort=year,desc&' 'production_status=released&title_type=feature' r = requests.get(URL) section_toc = BeautifulSoup(r.text) print (section_toc.prettify()) for i, tag in enumerate(section_toc.findAll('a')): print(tag['href']) main = section_toc.find(name='div', attrs={'id': 'main'}) for link in main.findAll('a'): print(link['href']) print(link.string) By the way, I commented out my keep_only_tags and remove_tags sections in my recipe to try to rule that out as well, no luck, my code still can't find the #main DIV, or any links in the original soup. I didn't post the whole recipe because it's about 130 lines now and it's my responsibility to find what it is probably a logic error I introduced somewhere -- just wanted to see if this sounds like something anyone has heard of before. Last edited by ireadtheinternet; 11-15-2014 at 01:19 PM.