|
|
#1 |
|
Member
![]() Posts: 21
Karma: 10
Join Date: Oct 2014
Device: Android
|
Ever heard of self.index_to_soup returning a soup with a HEAD but not a BODY?
I had my recipe almost working, and somehow either I broke it during of my fast and furious edits (very likely) but it seems a little like the site was redesigned slightly too, because I can't get any older versions of the recipe to work either (a few were on my Windows 7 Previous Versions tab). So I am happy to take my lumps for not bothering with version control and rebuild this little recipe again piece by piece but I noticed something that seemed odd to me and I wondered if this sounded familiar at all. I seem to be getting a soup back from index_to_soup that only has the HEAD and not the BODY. It was working fine until either I broke it (probably) or the IMDB site broke it. Code:
def build_section(self, url):
# this method is called from parse_index(self)
articles = []
section_toc = self.index_to_soup(url)
# confirms this gets the page but ends with </HEAD> </HTML> no "BODY"
print(url)
print (section_toc.prettify())
movies_on_page = section_toc.find(name='div', attrs={'id': 'main'})
if movies_on_page is None:
print ('movies_on_page is None')
# there's more to it, but it is not going to work, if the above won't
Code:
import requests
from BeautifulSoup import BeautifulSoup #bs3
URL = 'http://www.imdb.com/search/title?sort=year,desc&'
'production_status=released&title_type=feature'
r = requests.get(URL)
section_toc = BeautifulSoup(r.text)
print (section_toc.prettify())
for i, tag in enumerate(section_toc.findAll('a')):
print(tag['href'])
main = section_toc.find(name='div', attrs={'id': 'main'})
for link in main.findAll('a'):
print(link['href'])
print(link.string)
I didn't post the whole recipe because it's about 130 lines now and it's my responsibility to find what it is probably a logic error I introduced somewhere -- just wanted to see if this sounds like something anyone has heard of before. Last edited by ireadtheinternet; 11-15-2014 at 01:19 PM. |
|
|
|
|
|
#2 |
|
Member
![]() Posts: 21
Karma: 10
Join Date: Oct 2014
Device: Android
|
It's better now, it is definitely a logic error introduced somewhere in my code as I was almost finished with it. I proved this by doing the same thing I did to start with-- taking "The Friday Times" recipe and modifying it (perfect simple recipe for starting a non-RSS recipe btw) and plugging in a sample section URL, and it pulls links just fine. Will start adding small pieces back into the code when I get around to it, so I will know where I broke it.
|
|
|
|
| Advert | |
|
|
|
|
#3 |
|
Member
![]() Posts: 21
Karma: 10
Join Date: Oct 2014
Device: Android
|
After posting this, I broke it again almost immediately. Why isn't this recipe finding the main div? This is the whole recipe. I had the line
Code:
print toc_page.prettify() Code:
from calibre.web.feeds.news import BasicNewsRecipe
import re
keep_only_tags = [
dict(name='div', attrs={'id': ['main']})
]
class IMDBAdvancedTitleSearch(BasicNewsRecipe):
title = u'IMDB Advanced Title Search'
__author__ = 'ireadtheinternet'
no_stylesheets = True
no_javascript = True
def parse_index(self):
toc_page = self.index_to_soup('http://www.imdb.com/search/title?sort=year,desc&production_status=released&title_type=feature')
toc = toc_page.find(name='div', attrs={'id':'main'})
if toc is None:
print '***toc is None***'
# ***toc in None*** prints
articles = []
for movie in toc.findAll('a', attrs={'href':re.compile(r'/title/tt.*')}):
print(movie)
title = self.tag_to_string(movie)
url = 'http://www.imdb.com' + movie['href']
self.log('Found article:', movie)
self.log('\t', url)
articles.append({'title':title, 'url':url, 'date':'',
'description':''})
return [('Movies', articles)]
|
|
|
|
|
|
#4 |
|
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,634
Karma: 28549046
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Most likely something in the page's markup is preventing it from being parsed properly. Check the raw markup without parsing, which you can get with
self.index_to_soup(url, raw=True) |
|
|
|
|
|
#5 |
|
Member
![]() Posts: 21
Karma: 10
Join Date: Oct 2014
Device: Android
|
Thanks as always, Kovid! This helped.
It worked when I changed the first lines of parse_index to Code:
def parse_index(self):
toc_page_raw = self.index_to_soup('http://www.imdb.com/search/title?sort=year,desc&production_status=released&title_type=feature', raw=True)
toc_page_raw = re.sub(r'<script\b.+?</script>', '', toc_page_raw, flags=re.DOTALL|re.IGNORECASE)
toc_page = self.index_to_soup(toc_page_raw)
toc = toc_page.find(name='div', attrs={'id':'main'})
...
|
|
|
|
| Advert | |
|
|
![]() |
|
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Using Beautiful Soup from a plugin | geekraver | Development | 3 | 01-26-2014 06:29 PM |
| Word Soup | kranu | Amazon Kindle | 8 | 03-11-2011 05:25 PM |
| Pocket eDGe and Full-size eDGe: Head to Head Comparison | alefor | enTourage Archive | 28 | 12-01-2010 08:44 PM |
| Supernatural soup | bmwvan | Reading Recommendations | 30 | 08-02-2008 12:25 AM |
| Video Head to Head of Kindle and Reader | Kingston | Which one should I buy? | 30 | 01-24-2008 09:03 PM |