![]() |
#1 |
Member
![]() Posts: 21
Karma: 10
Join Date: Oct 2014
Device: Android
|
Some of the information in my articles is truncated, and I need to pull plot summary information from a kind of "See more.." link, so I need to request a page beyond the article so I can extract that information. What would be the recommended way to do this?
I thought about setting a recursion level in the recipe, and then detect what page I am on, and extract the article if it an article page, and then extract the plot summary if it is a plot summary page. Seems like I would have to create a dictionary for the articles and for the plot summaries. Seems like a lot of work, and besides, I am wishing for something more general purpose (what if the other URL was not linked from the original article, preventing me from using recursion?) I thought about putting in "import requests" and using that module and then putting that into my own BeautifulSoup instance. That would be a lot more straightforward than what I just suggested, but I found the requests module is not built-in. Last edited by ireadtheinternet; 10-31-2014 at 07:21 AM. Reason: correction, clarity |
![]() |
![]() |
![]() |
#2 |
Member
![]() Posts: 21
Karma: 10
Join Date: Oct 2014
Device: Android
|
It looks I could do this by doing something in preprocess_raw_html function, and use the index_to_soup method there.
Last edited by ireadtheinternet; 10-31-2014 at 07:26 AM. |
![]() |
![]() |
Advert | |
|
![]() |
#3 |
Member
![]() Posts: 21
Karma: 10
Join Date: Oct 2014
Device: Android
|
Did this successfully in preprocess_html. Was shocked it worked on the first try. Will update with code when I am at my machine. For some reason, I had been under the impression that I could only use index_to_soup for my table of contents (in parse_index), but that was wrong.
|
![]() |
![]() |
![]() |
#4 |
Member
![]() Posts: 21
Karma: 10
Join Date: Oct 2014
Device: Android
|
Here is what I ended up with so far, just to give an idea.
Code:
def preprocess_html(self, soup): IMDB_BASE = 'http://www.imdb.com' truncated_summary = soup.find('p', attrs={'itemprop': ['description']}) link_to_full_summary = truncated_summary.find('a') if link_to_full_summary is not None: full_summary_soup = self.index_to_soup(IMDB_BASE + link_to_full_summary['href']) full_plot_summary = full_summary_soup.find('p', attrs={'class': ['plotSummary']}) truncated_summary.replaceWith(full_plot_summary) return soup EDITED: 11/4 - Today I learned this same full summary is on the same page below the truncated summary and the credits, so I just took this bit of code out of the recipe since it is not needed. Last edited by ireadtheinternet; 11-04-2014 at 01:03 AM. |
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
recipe to ignore image size constraint on web page | rutmang | Recipes | 4 | 07-31-2014 04:52 PM |
Recipe creates long non-scrollable page? | TechnoCat | Recipes | 3 | 03-21-2012 06:27 AM |
Script to scrape page for a cover image for recipe? | adoucette | Recipes | 12 | 02-29-2012 06:24 PM |
Kathimerini recipe on Kindle 3: Only first page shows | jennie | Recipes | 2 | 05-27-2011 04:06 AM |
How to add my own html page to recipe | naisren | Recipes | 3 | 11-17-2010 04:37 PM |