Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 10-30-2014, 07:59 AM   #1
ireadtheinternet
Member
ireadtheinternet began at the beginning.
 
Posts: 21
Karma: 10
Join Date: Oct 2014
Device: Android
Some of the information in my articles is truncated, and I need to pull plot summary information from a kind of "See more.." link, so I need to request a page beyond the article so I can extract that information. What would be the recommended way to do this?

I thought about setting a recursion level in the recipe, and then detect what page I am on, and extract the article if it an article page, and then extract the plot summary if it is a plot summary page. Seems like I would have to create a dictionary for the articles and for the plot summaries. Seems like a lot of work, and besides, I am wishing for something more general purpose (what if the other URL was not linked from the original article, preventing me from using recursion?)

I thought about putting in "import requests" and using that module and then putting that into my own BeautifulSoup instance. That would be a lot more straightforward than what I just suggested, but I found the requests module is not built-in.

Last edited by ireadtheinternet; 10-31-2014 at 07:21 AM. Reason: correction, clarity
ireadtheinternet is offline   Reply With Quote
Old 10-30-2014, 11:09 PM   #2
ireadtheinternet
Member
ireadtheinternet began at the beginning.
 
Posts: 21
Karma: 10
Join Date: Oct 2014
Device: Android
It looks I could do this by doing something in preprocess_raw_html function, and use the index_to_soup method there.

Last edited by ireadtheinternet; 10-31-2014 at 07:26 AM.
ireadtheinternet is offline   Reply With Quote
Advert
Old 10-31-2014, 03:12 PM   #3
ireadtheinternet
Member
ireadtheinternet began at the beginning.
 
Posts: 21
Karma: 10
Join Date: Oct 2014
Device: Android
Did this successfully in preprocess_html. Was shocked it worked on the first try. Will update with code when I am at my machine. For some reason, I had been under the impression that I could only use index_to_soup for my table of contents (in parse_index), but that was wrong.
ireadtheinternet is offline   Reply With Quote
Old 10-31-2014, 05:37 PM   #4
ireadtheinternet
Member
ireadtheinternet began at the beginning.
 
Posts: 21
Karma: 10
Join Date: Oct 2014
Device: Android
Here is what I ended up with so far, just to give an idea.

Code:
    def preprocess_html(self, soup):
        IMDB_BASE = 'http://www.imdb.com'
        
        truncated_summary = soup.find('p', attrs={'itemprop': ['description']})
        link_to_full_summary = truncated_summary.find('a')
        if link_to_full_summary is not None:
            full_summary_soup = self.index_to_soup(IMDB_BASE + link_to_full_summary['href'])
            full_plot_summary = full_summary_soup.find('p', attrs={'class': ['plotSummary']})
            truncated_summary.replaceWith(full_plot_summary)
            
        return soup
I will post the whole recipe when done.

EDITED: 11/4 - Today I learned this same full summary is on the same page below the truncated summary and the credits, so I just took this bit of code out of the recipe since it is not needed.

Last edited by ireadtheinternet; 11-04-2014 at 01:03 AM.
ireadtheinternet is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
recipe to ignore image size constraint on web page rutmang Recipes 4 07-31-2014 04:52 PM
Recipe creates long non-scrollable page? TechnoCat Recipes 3 03-21-2012 06:27 AM
Script to scrape page for a cover image for recipe? adoucette Recipes 12 02-29-2012 06:24 PM
Kathimerini recipe on Kindle 3: Only first page shows jennie Recipes 2 05-27-2011 04:06 AM
How to add my own html page to recipe naisren Recipes 3 11-17-2010 04:37 PM


All times are GMT -4. The time now is 09:52 AM.


MobileRead.com is a privately owned, operated and funded community.