![]() |
#1 |
Member
![]() Posts: 21
Karma: 10
Join Date: Oct 2014
Device: Android
|
![]()
Ever heard of self.index_to_soup returning a soup with a HEAD but not a BODY?
I had my recipe almost working, and somehow either I broke it during of my fast and furious edits (very likely) but it seems a little like the site was redesigned slightly too, because I can't get any older versions of the recipe to work either (a few were on my Windows 7 Previous Versions tab). So I am happy to take my lumps for not bothering with version control and rebuild this little recipe again piece by piece but I noticed something that seemed odd to me and I wondered if this sounded familiar at all. I seem to be getting a soup back from index_to_soup that only has the HEAD and not the BODY. It was working fine until either I broke it (probably) or the IMDB site broke it. Code:
def build_section(self, url): # this method is called from parse_index(self) articles = [] section_toc = self.index_to_soup(url) # confirms this gets the page but ends with </HEAD> </HTML> no "BODY" print(url) print (section_toc.prettify()) movies_on_page = section_toc.find(name='div', attrs={'id': 'main'}) if movies_on_page is None: print ('movies_on_page is None') # there's more to it, but it is not going to work, if the above won't Code:
import requests from BeautifulSoup import BeautifulSoup #bs3 URL = 'http://www.imdb.com/search/title?sort=year,desc&' 'production_status=released&title_type=feature' r = requests.get(URL) section_toc = BeautifulSoup(r.text) print (section_toc.prettify()) for i, tag in enumerate(section_toc.findAll('a')): print(tag['href']) main = section_toc.find(name='div', attrs={'id': 'main'}) for link in main.findAll('a'): print(link['href']) print(link.string) I didn't post the whole recipe because it's about 130 lines now and it's my responsibility to find what it is probably a logic error I introduced somewhere -- just wanted to see if this sounds like something anyone has heard of before. Last edited by ireadtheinternet; 11-15-2014 at 12:19 PM. |
![]() |
![]() |
![]() |
#2 |
Member
![]() Posts: 21
Karma: 10
Join Date: Oct 2014
Device: Android
|
It's better now, it is definitely a logic error introduced somewhere in my code as I was almost finished with it. I proved this by doing the same thing I did to start with-- taking "The Friday Times" recipe and modifying it (perfect simple recipe for starting a non-RSS recipe btw) and plugging in a sample section URL, and it pulls links just fine. Will start adding small pieces back into the code when I get around to it, so I will know where I broke it.
|
![]() |
![]() |
Advert | |
|
![]() |
#3 |
Member
![]() Posts: 21
Karma: 10
Join Date: Oct 2014
Device: Android
|
![]()
After posting this, I broke it again almost immediately. Why isn't this recipe finding the main div? This is the whole recipe. I had the line
Code:
print toc_page.prettify() Code:
from calibre.web.feeds.news import BasicNewsRecipe import re keep_only_tags = [ dict(name='div', attrs={'id': ['main']}) ] class IMDBAdvancedTitleSearch(BasicNewsRecipe): title = u'IMDB Advanced Title Search' __author__ = 'ireadtheinternet' no_stylesheets = True no_javascript = True def parse_index(self): toc_page = self.index_to_soup('http://www.imdb.com/search/title?sort=year,desc&production_status=released&title_type=feature') toc = toc_page.find(name='div', attrs={'id':'main'}) if toc is None: print '***toc is None***' # ***toc in None*** prints articles = [] for movie in toc.findAll('a', attrs={'href':re.compile(r'/title/tt.*')}): print(movie) title = self.tag_to_string(movie) url = 'http://www.imdb.com' + movie['href'] self.log('Found article:', movie) self.log('\t', url) articles.append({'title':title, 'url':url, 'date':'', 'description':''}) return [('Movies', articles)] |
![]() |
![]() |
![]() |
#4 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,337
Karma: 27182818
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Most likely something in the page's markup is preventing it from being parsed properly. Check the raw markup without parsing, which you can get with
self.index_to_soup(url, raw=True) |
![]() |
![]() |
![]() |
#5 |
Member
![]() Posts: 21
Karma: 10
Join Date: Oct 2014
Device: Android
|
Thanks as always, Kovid! This helped.
It worked when I changed the first lines of parse_index to Code:
def parse_index(self): toc_page_raw = self.index_to_soup('http://www.imdb.com/search/title?sort=year,desc&production_status=released&title_type=feature', raw=True) toc_page_raw = re.sub(r'<script\b.+?</script>', '', toc_page_raw, flags=re.DOTALL|re.IGNORECASE) toc_page = self.index_to_soup(toc_page_raw) toc = toc_page.find(name='div', attrs={'id':'main'}) ... |
![]() |
![]() |
Advert | |
|
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Using Beautiful Soup from a plugin | geekraver | Development | 3 | 01-26-2014 05:29 PM |
Word Soup | kranu | Amazon Kindle | 8 | 03-11-2011 04:25 PM |
Pocket eDGe and Full-size eDGe: Head to Head Comparison | alefor | enTourage Archive | 28 | 12-01-2010 07:44 PM |
Supernatural soup | bmwvan | Reading Recommendations | 30 | 08-01-2008 11:25 PM |
Video Head to Head of Kindle and Reader | Kingston | Which one should I buy? | 30 | 01-24-2008 08:03 PM |