Requesting a different page in a recipe

ireadtheinternet · 10-30-2014, 07:59 AM

Some of the information in my articles is truncated, and I need to pull plot summary information from a kind of "See more.." link, so I need to request a page beyond the article so I can extract that information. What would be the recommended way to do this?

I thought about setting a recursion level in the recipe, and then detect what page I am on, and extract the article if it an article page, and then extract the plot summary if it is a plot summary page. Seems like I would have to create a dictionary for the articles and for the plot summaries. Seems like a lot of work, and besides, I am wishing for something more general purpose (what if the other URL was not linked from the original article, preventing me from using recursion?)

I thought about putting in "import requests" and using that module and then putting that into my own BeautifulSoup instance. That would be a lot more straightforward than what I just suggested, but I found the requests module is not built-in.

ireadtheinternet · 10-30-2014, 11:09 PM

It looks I could do this by doing something in preprocess_raw_html function, and use the index_to_soup method there.

ireadtheinternet · 10-31-2014, 03:12 PM

Did this successfully in preprocess_html. Was shocked it worked on the first try. Will update with code when I am at my machine. For some reason, I had been under the impression that I could only use index_to_soup for my table of contents (in parse_index), but that was wrong.

ireadtheinternet · 10-31-2014, 05:37 PM

Here is what I ended up with so far, just to give an idea.

Code:

    def preprocess_html(self, soup):
        IMDB_BASE = 'http://www.imdb.com'
        
        truncated_summary = soup.find('p', attrs={'itemprop': ['description']})
        link_to_full_summary = truncated_summary.find('a')
        if link_to_full_summary is not None:
            full_summary_soup = self.index_to_soup(IMDB_BASE + link_to_full_summary['href'])
            full_plot_summary = full_summary_soup.find('p', attrs={'class': ['plotSummary']})
            truncated_summary.replaceWith(full_plot_summary)
            
        return soup

I will post the whole recipe when done.

EDITED: 11/4 - Today I learned this same full summary is on the same page below the truncated summary and the credits, so I just took this bit of code out of the recipe since it is not needed.

10-30-2014, 07:59 AM	#1
ireadtheinternet Member Posts: 21 Karma: 10 Join Date: Oct 2014 Device: Android	Some of the information in my articles is truncated, and I need to pull plot summary information from a kind of "See more.." link, so I need to request a page beyond the article so I can extract that information. What would be the recommended way to do this? I thought about setting a recursion level in the recipe, and then detect what page I am on, and extract the article if it an article page, and then extract the plot summary if it is a plot summary page. Seems like I would have to create a dictionary for the articles and for the plot summaries. Seems like a lot of work, and besides, I am wishing for something more general purpose (what if the other URL was not linked from the original article, preventing me from using recursion?) I thought about putting in "import requests" and using that module and then putting that into my own BeautifulSoup instance. That would be a lot more straightforward than what I just suggested, but I found the requests module is not built-in. Last edited by ireadtheinternet; 10-31-2014 at 07:21 AM. Reason: correction, clarity

10-30-2014, 11:09 PM	#2
ireadtheinternet Member Posts: 21 Karma: 10 Join Date: Oct 2014 Device: Android	It looks I could do this by doing something in preprocess_raw_html function, and use the index_to_soup method there. Last edited by ireadtheinternet; 10-31-2014 at 07:26 AM.

10-31-2014, 05:37 PM	#4
ireadtheinternet Member Posts: 21 Karma: 10 Join Date: Oct 2014 Device: Android	Here is what I ended up with so far, just to give an idea. Code: def preprocess_html(self, soup): IMDB_BASE = 'http://www.imdb.com' truncated_summary = soup.find('p', attrs={'itemprop': ['description']}) link_to_full_summary = truncated_summary.find('a') if link_to_full_summary is not None: full_summary_soup = self.index_to_soup(IMDB_BASE + link_to_full_summary['href']) full_plot_summary = full_summary_soup.find('p', attrs={'class': ['plotSummary']}) truncated_summary.replaceWith(full_plot_summary) return soup I will post the whole recipe when done. EDITED: 11/4 - Today I learned this same full summary is on the same page below the truncated summary and the credits, so I just took this bit of code out of the recipe since it is not needed. Last edited by ireadtheinternet; 11-04-2014 at 01:03 AM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
recipe to ignore image size constraint on web page	rutmang	Recipes	4	07-31-2014 04:52 PM
Recipe creates long non-scrollable page?	TechnoCat	Recipes	3	03-21-2012 06:27 AM
Script to scrape page for a cover image for recipe?	adoucette	Recipes	12	02-29-2012 06:24 PM
Kathimerini recipe on Kindle 3: Only first page shows	jennie	Recipes	2	05-27-2011 04:06 AM
How to add my own html page to recipe	naisren	Recipes	3	11-17-2010 04:37 PM

10-31-2014, 03:12 PM	#3
ireadtheinternet Member Posts: 21 Karma: 10 Join Date: Oct 2014 Device: Android	Did this successfully in preprocess_html. Was shocked it worked on the first try. Will update with code when I am at my machine. For some reason, I had been under the impression that I could only use index_to_soup for my table of contents (in parse_index), but that was wrong.

Advert