View Single Post
Old 11-15-2014, 12:16 PM   #1
ireadtheinternet
Member
ireadtheinternet began at the beginning.
 
Posts: 21
Karma: 10
Join Date: Oct 2014
Device: Android
Question soup with a HEAD but no BODY?

Ever heard of self.index_to_soup returning a soup with a HEAD but not a BODY?

I had my recipe almost working, and somehow either I broke it during of my fast and furious edits (very likely) but it seems a little like the site was redesigned slightly too, because I can't get any older versions of the recipe to work either (a few were on my Windows 7 Previous Versions tab). So I am happy to take my lumps for not bothering with version control and rebuild this little recipe again piece by piece but I noticed something that seemed odd to me and I wondered if this sounded familiar at all. I seem to be getting a soup back from index_to_soup that only has the HEAD and not the BODY. It was working fine until either I broke it (probably) or the IMDB site broke it.


Code:
    def build_section(self, url):
        
        # this method is called from parse_index(self)

        articles = []
        section_toc = self.index_to_soup(url) 

        # confirms this gets the page but ends with </HEAD> </HTML> no "BODY"
        print(url)
        print (section_toc.prettify())
        
        movies_on_page = section_toc.find(name='div', attrs={'id': 'main'})

        if movies_on_page is None:
            print ('movies_on_page is None')

        # there's more to it, but it is not going to work, if the above won't
I noticed there is some funky Javascript that seems to outputting the body including !DOCTYPE directives, I don't know if it was always that way but regardless of that, it seems to still work fine using BeautifulSoup3 outside of a Calbre recipe. The following code has the expected output, but the same loop in my recipe can't find the #main div or even any <a> links:

Code:
import requests
from BeautifulSoup import BeautifulSoup #bs3

URL = 'http://www.imdb.com/search/title?sort=year,desc&'
'production_status=released&title_type=feature'

r = requests.get(URL)
section_toc = BeautifulSoup(r.text)
print (section_toc.prettify())

for i, tag in enumerate(section_toc.findAll('a')):
    print(tag['href'])

main = section_toc.find(name='div', attrs={'id': 'main'})

for link in main.findAll('a'):
    print(link['href'])
    print(link.string)
By the way, I commented out my keep_only_tags and remove_tags sections in my recipe to try to rule that out as well, no luck, my code still can't find the #main DIV, or any links in the original soup.


I didn't post the whole recipe because it's about 130 lines now and it's my responsibility to find what it is probably a logic error I introduced somewhere -- just wanted to see if this sounds like something anyone has heard of before.

Last edited by ireadtheinternet; 11-15-2014 at 12:19 PM.
ireadtheinternet is offline   Reply With Quote