Ever heard of self.index_to_soup returning a soup with a HEAD but not a BODY?
I had my recipe almost working, and somehow either I broke it during of my fast and furious edits (very likely) but it seems a little like the site was redesigned slightly too, because I can't get any older versions of the recipe to work either (a few were on my Windows 7 Previous Versions tab). So I am happy to take my lumps for not bothering with version control and rebuild this little recipe again piece by piece but I noticed something that seemed odd to me and I wondered if this sounded familiar at all. I seem to be getting a soup back from index_to_soup that only has the HEAD and not the BODY. It was working fine until either I broke it (probably) or the IMDB site broke it.
Code:
def build_section(self, url):
# this method is called from parse_index(self)
articles = []
section_toc = self.index_to_soup(url)
# confirms this gets the page but ends with </HEAD> </HTML> no "BODY"
print(url)
print (section_toc.prettify())
movies_on_page = section_toc.find(name='div', attrs={'id': 'main'})
if movies_on_page is None:
print ('movies_on_page is None')
# there's more to it, but it is not going to work, if the above won't
I noticed there is some funky Javascript that seems to outputting the body including !DOCTYPE directives, I don't know if it was always that way but regardless of that, it seems to still work fine using BeautifulSoup3 outside of a Calbre recipe. The following code has the expected output, but the same loop in my recipe can't find the #main div or even any <a> links:
Code:
import requests
from BeautifulSoup import BeautifulSoup #bs3
URL = 'http://www.imdb.com/search/title?sort=year,desc&'
'production_status=released&title_type=feature'
r = requests.get(URL)
section_toc = BeautifulSoup(r.text)
print (section_toc.prettify())
for i, tag in enumerate(section_toc.findAll('a')):
print(tag['href'])
main = section_toc.find(name='div', attrs={'id': 'main'})
for link in main.findAll('a'):
print(link['href'])
print(link.string)
By the way, I commented out my keep_only_tags and remove_tags sections in my recipe to try to rule that out as well, no luck, my code still can't find the #main DIV, or any links in the original soup.
I didn't post the whole recipe because it's about 130 lines now and it's my responsibility to find what it is probably a logic error I introduced somewhere -- just wanted to see if this sounds like something anyone has heard of before.