Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 11-15-2014, 12:16 PM   #1
ireadtheinternet
Member
ireadtheinternet began at the beginning.
 
Posts: 21
Karma: 10
Join Date: Oct 2014
Device: Android
Question soup with a HEAD but no BODY?

Ever heard of self.index_to_soup returning a soup with a HEAD but not a BODY?

I had my recipe almost working, and somehow either I broke it during of my fast and furious edits (very likely) but it seems a little like the site was redesigned slightly too, because I can't get any older versions of the recipe to work either (a few were on my Windows 7 Previous Versions tab). So I am happy to take my lumps for not bothering with version control and rebuild this little recipe again piece by piece but I noticed something that seemed odd to me and I wondered if this sounded familiar at all. I seem to be getting a soup back from index_to_soup that only has the HEAD and not the BODY. It was working fine until either I broke it (probably) or the IMDB site broke it.


Code:
    def build_section(self, url):
        
        # this method is called from parse_index(self)

        articles = []
        section_toc = self.index_to_soup(url) 

        # confirms this gets the page but ends with </HEAD> </HTML> no "BODY"
        print(url)
        print (section_toc.prettify())
        
        movies_on_page = section_toc.find(name='div', attrs={'id': 'main'})

        if movies_on_page is None:
            print ('movies_on_page is None')

        # there's more to it, but it is not going to work, if the above won't
I noticed there is some funky Javascript that seems to outputting the body including !DOCTYPE directives, I don't know if it was always that way but regardless of that, it seems to still work fine using BeautifulSoup3 outside of a Calbre recipe. The following code has the expected output, but the same loop in my recipe can't find the #main div or even any <a> links:

Code:
import requests
from BeautifulSoup import BeautifulSoup #bs3

URL = 'http://www.imdb.com/search/title?sort=year,desc&'
'production_status=released&title_type=feature'

r = requests.get(URL)
section_toc = BeautifulSoup(r.text)
print (section_toc.prettify())

for i, tag in enumerate(section_toc.findAll('a')):
    print(tag['href'])

main = section_toc.find(name='div', attrs={'id': 'main'})

for link in main.findAll('a'):
    print(link['href'])
    print(link.string)
By the way, I commented out my keep_only_tags and remove_tags sections in my recipe to try to rule that out as well, no luck, my code still can't find the #main DIV, or any links in the original soup.


I didn't post the whole recipe because it's about 130 lines now and it's my responsibility to find what it is probably a logic error I introduced somewhere -- just wanted to see if this sounds like something anyone has heard of before.

Last edited by ireadtheinternet; 11-15-2014 at 12:19 PM.
ireadtheinternet is offline   Reply With Quote
Old 11-15-2014, 01:47 PM   #2
ireadtheinternet
Member
ireadtheinternet began at the beginning.
 
Posts: 21
Karma: 10
Join Date: Oct 2014
Device: Android
It's better now, it is definitely a logic error introduced somewhere in my code as I was almost finished with it. I proved this by doing the same thing I did to start with-- taking "The Friday Times" recipe and modifying it (perfect simple recipe for starting a non-RSS recipe btw) and plugging in a sample section URL, and it pulls links just fine. Will start adding small pieces back into the code when I get around to it, so I will know where I broke it.
ireadtheinternet is offline   Reply With Quote
Advert
Old 11-26-2014, 06:52 AM   #3
ireadtheinternet
Member
ireadtheinternet began at the beginning.
 
Posts: 21
Karma: 10
Join Date: Oct 2014
Device: Android
Question

After posting this, I broke it again almost immediately. Why isn't this recipe finding the main div? This is the whole recipe. I had the line
Code:
 print toc_page.prettify()
in it before and the soup only seems to have a HEAD but no BODY.

Code:
from calibre.web.feeds.news import BasicNewsRecipe
import re

keep_only_tags = [
    dict(name='div', attrs={'id': ['main']})
]

class IMDBAdvancedTitleSearch(BasicNewsRecipe):
    title          = u'IMDB Advanced Title Search'
    __author__         = 'ireadtheinternet'
    no_stylesheets = True
    no_javascript = True

    def parse_index(self):
        toc_page = self.index_to_soup('http://www.imdb.com/search/title?sort=year,desc&production_status=released&title_type=feature')
        toc = toc_page.find(name='div', attrs={'id':'main'})
        if toc is None:
            print '***toc is None***'
        # ***toc in None*** prints    
        articles = []
        for movie in toc.findAll('a', attrs={'href':re.compile(r'/title/tt.*')}):
            print(movie)
            title = self.tag_to_string(movie)
            url = 'http://www.imdb.com' + movie['href']
            self.log('Found article:', movie)
            self.log('\t', url)
            articles.append({'title':title, 'url':url, 'date':'',
            'description':''})

        return [('Movies', articles)]
ireadtheinternet is offline   Reply With Quote
Old 11-26-2014, 08:32 AM   #4
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 45,337
Karma: 27182818
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Most likely something in the page's markup is preventing it from being parsed properly. Check the raw markup without parsing, which you can get with

self.index_to_soup(url, raw=True)
kovidgoyal is offline   Reply With Quote
Old 11-28-2014, 12:44 AM   #5
ireadtheinternet
Member
ireadtheinternet began at the beginning.
 
Posts: 21
Karma: 10
Join Date: Oct 2014
Device: Android
Thanks as always, Kovid! This helped.

It worked when I changed the first lines of parse_index to
Code:
    def parse_index(self):
        toc_page_raw = self.index_to_soup('http://www.imdb.com/search/title?sort=year,desc&production_status=released&title_type=feature', raw=True)
        toc_page_raw = re.sub(r'<script\b.+?</script>', '', toc_page_raw, flags=re.DOTALL|re.IGNORECASE)
        toc_page = self.index_to_soup(toc_page_raw)
        toc = toc_page.find(name='div', attrs={'id':'main'})  
        ...
Now to merge this with my original..
ireadtheinternet is offline   Reply With Quote
Advert
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Using Beautiful Soup from a plugin geekraver Development 3 01-26-2014 05:29 PM
Word Soup kranu Amazon Kindle 8 03-11-2011 04:25 PM
Pocket eDGe and Full-size eDGe: Head to Head Comparison alefor enTourage Archive 28 12-01-2010 07:44 PM
Supernatural soup bmwvan Reading Recommendations 30 08-01-2008 11:25 PM
Video Head to Head of Kindle and Reader Kingston Which one should I buy? 30 01-24-2008 08:03 PM


All times are GMT -4. The time now is 03:02 AM.


MobileRead.com is a privately owned, operated and funded community.