soup with a HEAD but no BODY?

ireadtheinternet · 11-15-2014, 12:16 PM

Ever heard of self.index_to_soup returning a soup with a HEAD but not a BODY?

I had my recipe almost working, and somehow either I broke it during of my fast and furious edits (very likely) but it seems a little like the site was redesigned slightly too, because I can't get any older versions of the recipe to work either (a few were on my Windows 7 Previous Versions tab). So I am happy to take my lumps for not bothering with version control and rebuild this little recipe again piece by piece but I noticed something that seemed odd to me and I wondered if this sounded familiar at all. I seem to be getting a soup back from index_to_soup that only has the HEAD and not the BODY. It was working fine until either I broke it (probably) or the IMDB site broke it.

Code:

    def build_section(self, url):
        
        # this method is called from parse_index(self)

        articles = []
        section_toc = self.index_to_soup(url) 

        # confirms this gets the page but ends with </HEAD> </HTML> no "BODY"
        print(url)
        print (section_toc.prettify())
        
        movies_on_page = section_toc.find(name='div', attrs={'id': 'main'})

        if movies_on_page is None:
            print ('movies_on_page is None')

        # there's more to it, but it is not going to work, if the above won't

I noticed there is some funky Javascript that seems to outputting the body including !DOCTYPE directives, I don't know if it was always that way but regardless of that, it seems to still work fine using BeautifulSoup3 outside of a Calbre recipe. The following code has the expected output, but the same loop in my recipe can't find the #main div or even any <a> links:

Code:

import requests
from BeautifulSoup import BeautifulSoup #bs3

URL = 'http://www.imdb.com/search/title?sort=year,desc&'
'production_status=released&title_type=feature'

r = requests.get(URL)
section_toc = BeautifulSoup(r.text)
print (section_toc.prettify())

for i, tag in enumerate(section_toc.findAll('a')):
    print(tag['href'])

main = section_toc.find(name='div', attrs={'id': 'main'})

for link in main.findAll('a'):
    print(link['href'])
    print(link.string)

By the way, I commented out my keep_only_tags and remove_tags sections in my recipe to try to rule that out as well, no luck, my code still can't find the #main DIV, or any links in the original soup.

I didn't post the whole recipe because it's about 130 lines now and it's my responsibility to find what it is probably a logic error I introduced somewhere -- just wanted to see if this sounds like something anyone has heard of before.

ireadtheinternet · 11-15-2014, 01:47 PM

It's better now, it is definitely a logic error introduced somewhere in my code as I was almost finished with it. I proved this by doing the same thing I did to start with-- taking "The Friday Times" recipe and modifying it (perfect simple recipe for starting a non-RSS recipe btw) and plugging in a sample section URL, and it pulls links just fine. Will start adding small pieces back into the code when I get around to it, so I will know where I broke it.

ireadtheinternet · 11-26-2014, 06:52 AM

After posting this, I broke it again almost immediately. Why isn't this recipe finding the main div? This is the whole recipe. I had the line

Code:

 print toc_page.prettify()

in it before and the soup only seems to have a HEAD but no BODY.

Code:

from calibre.web.feeds.news import BasicNewsRecipe
import re

keep_only_tags = [
    dict(name='div', attrs={'id': ['main']})
]

class IMDBAdvancedTitleSearch(BasicNewsRecipe):
    title          = u'IMDB Advanced Title Search'
    __author__         = 'ireadtheinternet'
    no_stylesheets = True
    no_javascript = True

    def parse_index(self):
        toc_page = self.index_to_soup('http://www.imdb.com/search/title?sort=year,desc&production_status=released&title_type=feature')
        toc = toc_page.find(name='div', attrs={'id':'main'})
        if toc is None:
            print '***toc is None***'
        # ***toc in None*** prints    
        articles = []
        for movie in toc.findAll('a', attrs={'href':re.compile(r'/title/tt.*')}):
            print(movie)
            title = self.tag_to_string(movie)
            url = 'http://www.imdb.com' + movie['href']
            self.log('Found article:', movie)
            self.log('\t', url)
            articles.append({'title':title, 'url':url, 'date':'',
            'description':''})

        return [('Movies', articles)]

kovidgoyal · 11-26-2014, 08:32 AM

Most likely something in the page's markup is preventing it from being parsed properly. Check the raw markup without parsing, which you can get with

self.index_to_soup(url, raw=True)

ireadtheinternet · 11-28-2014, 12:44 AM

Thanks as always, Kovid! This helped.

It worked when I changed the first lines of parse_index to

Code:

    def parse_index(self):
        toc_page_raw = self.index_to_soup('http://www.imdb.com/search/title?sort=year,desc&production_status=released&title_type=feature', raw=True)
        toc_page_raw = re.sub(r'<script\b.+?</script>', '', toc_page_raw, flags=re.DOTALL|re.IGNORECASE)
        toc_page = self.index_to_soup(toc_page_raw)
        toc = toc_page.find(name='div', attrs={'id':'main'})  
        ...

Now to merge this with my original..

11-15-2014, 12:16 PM	#1
ireadtheinternet Member Posts: 21 Karma: 10 Join Date: Oct 2014 Device: Android	soup with a HEAD but no BODY? Ever heard of self.index_to_soup returning a soup with a HEAD but not a BODY? I had my recipe almost working, and somehow either I broke it during of my fast and furious edits (very likely) but it seems a little like the site was redesigned slightly too, because I can't get any older versions of the recipe to work either (a few were on my Windows 7 Previous Versions tab). So I am happy to take my lumps for not bothering with version control and rebuild this little recipe again piece by piece but I noticed something that seemed odd to me and I wondered if this sounded familiar at all. I seem to be getting a soup back from index_to_soup that only has the HEAD and not the BODY. It was working fine until either I broke it (probably) or the IMDB site broke it. Code: def build_section(self, url): # this method is called from parse_index(self) articles = [] section_toc = self.index_to_soup(url) # confirms this gets the page but ends with </HEAD> </HTML> no "BODY" print(url) print (section_toc.prettify()) movies_on_page = section_toc.find(name='div', attrs={'id': 'main'}) if movies_on_page is None: print ('movies_on_page is None') # there's more to it, but it is not going to work, if the above won't I noticed there is some funky Javascript that seems to outputting the body including !DOCTYPE directives, I don't know if it was always that way but regardless of that, it seems to still work fine using BeautifulSoup3 outside of a Calbre recipe. The following code has the expected output, but the same loop in my recipe can't find the #main div or even any <a> links: Code: import requests from BeautifulSoup import BeautifulSoup #bs3 URL = 'http://www.imdb.com/search/title?sort=year,desc&' 'production_status=released&title_type=feature' r = requests.get(URL) section_toc = BeautifulSoup(r.text) print (section_toc.prettify()) for i, tag in enumerate(section_toc.findAll('a')): print(tag['href']) main = section_toc.find(name='div', attrs={'id': 'main'}) for link in main.findAll('a'): print(link['href']) print(link.string) By the way, I commented out my keep_only_tags and remove_tags sections in my recipe to try to rule that out as well, no luck, my code still can't find the #main DIV, or any links in the original soup. I didn't post the whole recipe because it's about 130 lines now and it's my responsibility to find what it is probably a logic error I introduced somewhere -- just wanted to see if this sounds like something anyone has heard of before. Last edited by ireadtheinternet; 11-15-2014 at 12:19 PM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Using Beautiful Soup from a plugin	geekraver	Development	3	01-26-2014 05:29 PM
Word Soup	kranu	Amazon Kindle	8	03-11-2011 04:25 PM
Pocket eDGe and Full-size eDGe: Head to Head Comparison	alefor	enTourage Archive	28	12-01-2010 07:44 PM
Supernatural soup	bmwvan	Reading Recommendations	30	08-01-2008 11:25 PM
Video Head to Head of Kindle and Reader	Kingston	Which one should I buy?	30	01-24-2008 08:03 PM

11-15-2014, 01:47 PM	#2
ireadtheinternet Member Posts: 21 Karma: 10 Join Date: Oct 2014 Device: Android	It's better now, it is definitely a logic error introduced somewhere in my code as I was almost finished with it. I proved this by doing the same thing I did to start with-- taking "The Friday Times" recipe and modifying it (perfect simple recipe for starting a non-RSS recipe btw) and plugging in a sample section URL, and it pulls links just fine. Will start adding small pieces back into the code when I get around to it, so I will know where I broke it.

11-26-2014, 08:32 AM	#4
kovidgoyal creator of calibre Posts: 45,337 Karma: 27182818 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Most likely something in the page's markup is preventing it from being parsed properly. Check the raw markup without parsing, which you can get with self.index_to_soup(url, raw=True)

Advert

Advert