Custom recipes (archive, read-only) - Page 171

gk_jam · 08-29-2010, 12:41 AM

Ok, so it appears that fixing the duplication issue that I received help with in this thread has also resolved the issue of some items not showing up. All the relevant items now appear, so if anyone is having issues, please see this thread:

https://www.mobileread.com/forums/showthread.php?t=96351

TonytheBookworm · 08-29-2010, 01:38 AM

If there is documentation on this I wasn't able to find it, so could someone help me out please. I wanna parse this website that doesn't have an rss feed. But it has a link under each of the articles to read the full article.

Code:

<a href="/blogs/hunting/2010/08/guest-blog-5-reasons-plant-food-plots-now">Read Full Post</a>

So my question is how could I conduct a search for "Read Full Post" and get the href ?

My thoughts were something along the line of

Spoiler:

so if i have links

Code:

 1. <a href="/blogs/test1">Read Full Post </a>
 2. <a href="/blogs/test2">Read Full Post </a>

Then I would then only process the links of /blogs/test1 and /blogs/test2

thanks for the help

DoctorOhh · 08-29-2010, 02:38 AM

Quote:

Originally Posted by TonytheBookworm

If there is documentation on this I wasn't able to find it, so could someone help me out please. I wanna parse this website that doesn't have an rss feed. But it has a link under each of the articles to read the full article.

A link to the actual blog might help. One source of this article had an RSS feed so I wasn't able to review what you are seeing.

I think I see what your looking at now. You might want to start with the rss feed.

TonytheBookworm · 08-29-2010, 12:51 PM

Quote:

Originally Posted by dwanthny

A link to the actual blog might help. One source of this article had an RSS feed so I wasn't able to review what you are seeing.

I think I see what your looking at now. You might want to start with the rss feed.

Ha!! Thanks I didn't even see the rss at all. Once again appreciate it. I would still like to know how to parse for links though if anyone can guide me

Reason I'd like to know is even on this page not all the feeds have feeds. More specifically have a look at
http://www.fieldandstream.com/blogs
and notice "The Wild Chef" takes you to feeds.feedburner.com and nothing else

And the recipe blog was one of the main ones I wanted haha cause man gotta eat

Starson17 · 08-29-2010, 02:45 PM

Quote:

Originally Posted by TonytheBookworm

I would still like to know how to parse for links though if anyone can guide me

Reason I'd like to know is even on this page not all the feeds have feeds. More specifically have a look at
http://www.fieldandstream.com/blogs
and notice "The Wild Chef" takes you to feeds.feedburner.com and nothing else

And the recipe blog was one of the main ones I wanted haha cause man gotta eat

Let's start at: http://www.fieldandstream.com/blogs That page has links to feeds (as pointed out by dwanthny) and non-rss links (as in "The Wild Chef").

If all links were to rss feeds, you would use this page to manually get the feed links for your recipe, then the recipe would do all the work thereafter. Let's assume there are no RSS feeds. Then you would normally manually get all the other links from that page (and the title of the feed), and store them in a manually created dictionary of feed title and URL in your recipe. Each URL would be fed into parse_index.

Each time one of those URLs was fed into parse_index, it would parse the page, find all article links and build a feed structure for the matching feed title/URL that would then be appended to the feed list and be passed back into the recipe. How you build the feed structure depends on the pages, but basically, you need:

'title' : article title,
'url' : URL of article
'date' : The publication date of the article as a string,
'description' : A summary of the article

I suggest you search the recipes for "parse_index." There are dozens of examples of how this is done.

TonytheBookworm · 08-29-2010, 04:21 PM

Quote:

Originally Posted by Starson17

Let's start at:

'title' : article title,
'url' : URL of article
'date' : The publication date of the article as a string,
'description' : A summary of the article

I suggest you search the recipes for "parse_index." There are dozens of examples of how this is done.

I'm looking at The Atlantic and I have a general Idea of this.. I'm kinda stuff though. I'm trying to get the Section title

Spoiler:

In the example they used

Code:

sectit = soup.find('h1', attrs={'class':'sectionTitle'})

so I understand that that is looking for all h1 tags with a class=sectionTitle
but in my case I only have a href inside the h2 tags.

sorry for all the questions just trying to learn

Starson17 · 08-29-2010, 08:59 PM

Quote:

Originally Posted by TonytheBookworm

so I understand that that is looking for all h1 tags with a class=sectionTitle
but in my case I only have a href inside the h2 tags.

sorry for all the questions just trying to learn

Yes, and the h2 is inside a <div class="item-list"> element, etc. You would use Beautiful Soup to do this.

Let me refer to my GoComics recipe, as I'm more familiar with it.

Spoiler:

Above are the pairs of a title for a feed and a URL to scrape for articles. You would stick this in:

Code:

                            (u"Wild Chef", u"http://www.fieldandstream.com/blogs/wild-chef"),

Now normally, make_links(url) would scrape for the articles when you pass it the url. In my case, I didn't need to scrape, I could just figure out the article urls (each comic) and build the titles, but you can write your make_links(url) to scrape the URL.

You'd start with getting a soup for the url:
soup = self.index_to_soup(url)
then start scraping out the article urls and titles, etc. As you said, you have "href inside the h2 tags" the article title is really the string (NavigableString) inside the <a> tag. The url is the href atribute of the <a> tag (with a base URL stuck in front), and the summary is there too.

All of those are easily obtained using Beautiful Soup from the soup of the url given above. Scrape the url, build your article list for that feed, then it gets returned to parse_index and the next feed gets processed, etc.

I'm glad to see you working on a recipe (calibre-type) of recipes (food-type) - they're my favorite

Starson17 · 08-29-2010, 09:06 PM

Quote:

Originally Posted by Starson17

All of those are easily obtained using Beautiful Soup from the soup of the url given above.

If this part doesn't seem so easy, read the Beautiful Soup manual a few more times, then ask.

TonytheBookworm · 08-29-2010, 09:14 PM

Quote:

Originally Posted by Starson17

If this part doesn't seem so easy, read the Beautiful Soup manual a few more times, then ask.

Definitely!!! I try to figure stuff out on my own to begin with because that is really how I learn. But this task her took me to another level. I can follow the code in The Atlantic that I mentioned and understand exactly what it is doing and why but I was lost when it wasn't coinciding with the format I had at hand.

Thanks for the information you posted. I will read try and read some more. Oh one more thing. when testing this thing to make sure it works without having to load it into calibre and run an wait. Do I use the same test command string you provided me with previously?

Again I can't thank you and the others enough for helping me out on this. It is actually kinda fun. Frustrating at time when i run into something that I don't understand but other than that it is pretty fun to do.

And I will more than likely be asking more questions but once I get it I hope to help instead of ask

Starson17 · 08-29-2010, 09:29 PM

Quote:

Originally Posted by TonytheBookworm

I can follow the code in The Atlantic that I mentioned and understand exactly what it is doing and why but I was lost when it wasn't coinciding with the format I had at hand.

I hope my code helped a bit. That's how I would have approached it.

Quote:

Oh one more thing. when testing this thing to make sure it works without having to load it into calibre and run an wait. Do I use the same test command string you provided me with previously?

Yes.

Quote:

Again I can't thank you and the others enough for helping me out on this. It is actually kinda fun. Frustrating at time when i run into something that I don't understand but other than that it is pretty fun to do.

I see you've already started to write for others - that's great. Pass it forward.

Quote:

And I will more than likely be asking more questions but once I get it I hope to help instead of ask

Feel free to ask. Beautiful Soup can be confusing at times, but once you start to get a handle on it, it's very powerful.

TonytheBookworm · 08-29-2010, 10:51 PM

Alright I looked at some samples and I also seen what you had done. I went the second method that you mentioned though about making my own links. Well, I thought I was obviously not working. Here is what I am up with. if you have the time could you look at this and kinda shed some more light on me. Thanks.

Spoiler:

Code:

from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup

class FIELDSTREAM(BasicNewsRecipe):

    title      = 'FIELD AND STREAM BLOGS'
    __author__ = 'Tony Stegall'
    description = 'Hunting and Fishing and Gun Talk'
    INDEX = 'http://www.fieldandstream.com/blogs'
    language = 'en'

    
    no_stylesheets = True

   


   

    def parse_index(self):
       
        soup = self.index_to_soup(url)
        feeds =[]
        #array to hold the feeds
        for mainsec in soup.findAll('div',  attrs={'class':'item-list'}):
            #above findall instances where the div tag has the attribute of item-list
            section_title ='Wild Chef'
            #hard code the section title to be appended to the feed
            articles = []
            #array to hold the article content
            
            
            #-----------------------------------------------------------------------
            #trying to find all the h2 tags and parse the <a> for the title
            #not really understanding how this is done though
            #=-----------------------------------------------------------------------
            h = feedhead.find(['h2'])
            #find the h2 tag that has the title embedded inside it with an anchor tag
            
            a = mainsec.find('a', href=True)
            
            title = self.tag_to_string(a)
            
            myurl = a['href']
            if myurl.startswith('/'):
               myurl = 'http://www.fieldandstream.com' + url
               
            #--end of parse for title-----------------------------------------------------
            
            #-----------------------------------------------------------------------------------------------------------
            #face the same problem with the p tags.  I have a <p> tag then a <em> then in some cases another <p>
            #I want to get the content of the <p> within the <p> but not sure how :( 
            #example: 
            #   <p>
            #      <p> some blah blah blah </p>
            #   so basically all i want is all the text within the <div class=teaser> but not sure how :(
            for teaser in mainsec.findall('div',  attrs={'class':'teaser'}):
                p = post.find('p')
                desc = None
                if p is not None:
                    desc = self.tag_to_string(p)
            
                articles.append({'title':title, 'url':myurl, 'description':desc,
                    'date':''}) 
            #--------------------end of description parse from teaser-----------------------------------------------
            
             
            feeds.append((section_title, articles))  
            #put all articles for the section inside the feeds 
            
            return feeds

Starson17 · 08-30-2010, 09:32 AM

Quote:

Originally Posted by TonytheBookworm

Alright I looked at some samples and I also seen what you had done. I went the second method that you mentioned though about making my own links. Well, I thought I was obviously not working. Here is what I am up with. if you have the time could you look at this and kinda shed some more light on me. Thanks.

Spoiler:

Code:

from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup

class FIELDSTREAM(BasicNewsRecipe):

    title      = 'FIELD AND STREAM BLOGS'
    __author__ = 'Tony Stegall'
    description = 'Hunting and Fishing and Gun Talk'
    INDEX = 'http://www.fieldandstream.com/blogs'
    language = 'en'
    no_stylesheets = True
    def parse_index(self):
        soup = self.index_to_soup(url)
        feeds =[]
        #array to hold the feeds
        for mainsec in soup.findAll('div',  attrs={'class':'item-list'}):
            #above findall instances where the div tag has the attribute of item-list
            section_title ='Wild Chef'
            #hard code the section title to be appended to the feed
            articles = []
            #array to hold the article content
            #-----------------------------------------------------------------------
            #trying to find all the h2 tags and parse the <a> for the title
            #not really understanding how this is done though
#=-----------------------------------------------------------------------
            h = feedhead.find(['h2'])
            #find the h2 tag that has the title embedded inside it with an anchor tag
            
            a = mainsec.find('a', href=True)
            title = self.tag_to_string(a)
            myurl = a['href']
            if myurl.startswith('/'):
               myurl = 'http://www.fieldandstream.com' + url
               
            #--end of parse for title-----------------------------------------------------
#-----------------------------------------------------------------------------------------------------------
            #face the same problem with the p tags.  I have a <p> tag then a <em> then in some cases another <p>
            #I want to get the content of the <p> within the <p> but not sure how :( 
            #example: 
            #   <p>
            #      <p> some blah blah blah </p>
            #   so basically all i want is all the text within the <div class=teaser> but not sure how :(
            for teaser in mainsec.findall('div',  attrs={'class':'teaser'}):
                p = post.find('p')
                desc = None
                if p is not None:
                    desc = self.tag_to_string(p)
            
                articles.append({'title':title, 'url':myurl, 'description':desc,
                    'date':''}) 
            #--------------------end of description parse from teaser-----------------------------------------------
            
             
            feeds.append((section_title, articles))  
            #put all articles for the section inside the feeds 
            return feeds

Let's start at the top and look at the broad structure of what you're trying to do. I suggested you run parse_index with data pairs composed of a title for a feed and a url for a feed, then write a function that took the URL and parsed the page. I suggested you start the function with:
soup = self.index_to_soup(url)
where "url" was the url being passed to that function. In your code, you've taken the code that should have been in the called function and put it as the first line, but "url" isn't defined, so you never get a soup to work with.

To write effectively, you need to use print statements to see what's happening. Put
print 'the soup is: ', soup
after the line to see what the soup is is and you'll see url is not yet defined and tehre is no soup. If you're not going to do it the way GoComics did it, I suspect you want:
soup = self.index_to_soup("http://www.fieldandstream.com/blogs/wild-chef")
However, doing it this way will only give you one feed - the one for Wild Chef. Doing it the way GoComics does will let you set up multiple feeds.

Starson17 · 08-30-2010, 10:09 AM

Quote:

Originally Posted by Starson17

Let's start at the top and look at the broad structure of what you're trying to do.

Here: look at this:

Spoiler:

Edit:
Start with the above. It will give you the basic structure, since your code didn't appear to get to the page you needed to parse. The code above should get you there (check the printed soup to confirm in your output file). Once you have the soup being printed, we can work on the pseudocode. You should be able to adapt your own parsing code (as you posted) to replace the pseudocode above.

Note that you can leave description and date blank for testing. You only need to parse a title (and you can even set that to a constant) and just parse out the article URL.

ocefpaf · 08-30-2010, 02:00 PM

Hello I tried to write a recipe for this site but failed miserable. (I'm learning python)

http://clipping.radiobras.gov.br/cli...psesDetail.php

However, it seem that it will be easy since is a simple page. Can anyone help me?

TonytheBookworm · 08-30-2010, 03:44 PM

Starson17,
I was thinking the second method that you showed me was the method that was best suited for this situation. Actually, I wanted to learn both methods and have more tools/skills to work with in the future. Thanks for your continued support in this. I will work on what you have provided me with and get back with you when I have more questions. Once again I appreciate your time.

08-29-2010, 12:41 AM	#2551
gk_jam Junior Member Posts: 6 Karma: 10 Join Date: Aug 2010 Device: Kindle 3	Ok, so it appears that fixing the duplication issue that I received help with in this thread has also resolved the issue of some items not showing up. All the relevant items now appear, so if anyone is having issues, please see this thread: https://www.mobileread.com/forums/showthread.php?t=96351 Last edited by gk_jam; 08-29-2010 at 02:32 PM.

08-29-2010, 01:38 AM	#2552
TonytheBookworm Addict Posts: 264 Karma: 62 Join Date: May 2010 Device: kindle 2, kindle 3, Kindle fire	Find url where text = If there is documentation on this I wasn't able to find it, so could someone help me out please. I wanna parse this website that doesn't have an rss feed. But it has a link under each of the articles to read the full article. Code: <a href="/blogs/hunting/2010/08/guest-blog-5-reasons-plant-food-plots-now">Read Full Post</a> So my question is how could I conduct a search for "Read Full Post" and get the href ? My thoughts were something along the line of Spoiler: Code: def preprocess_html(soup) for link in soup.findall('a') if link['href'] and len(link['href']>0: found_link[1] = link['href'] Return found_link so if i have links Code: 1. <a href="/blogs/test1">Read Full Post </a> 2. <a href="/blogs/test2">Read Full Post </a> Then I would then only process the links of /blogs/test1 and /blogs/test2 thanks for the help

08-30-2010, 02:00 PM	#2564
ocefpaf Junior Member Posts: 2 Karma: 10 Join Date: Aug 2010 Device: kindle dx	Help with recipe Hello I tried to write a recipe for this site but failed miserable. (I'm learning python) http://clipping.radiobras.gov.br/cli...psesDetail.php However, it seem that it will be easy since is a simple page. Can anyone help me?

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Custom column read ?	pchrist7	Calibre	2	10-04-2010 02:52 AM
Archive for custom screensavers	sleeplessdave	Amazon Kindle	1	07-07-2010 12:33 PM
How to back up preferences and custom recipes?	greenapple	Calibre	3	03-29-2010 05:08 AM
Donations for Custom Recipes	ddavtian	Calibre	5	01-23-2010 04:54 PM
Help understanding custom recipes	andersent	Calibre	0	12-17-2009 02:37 PM

08-30-2010, 03:44 PM	#2565
TonytheBookworm Addict Posts: 264 Karma: 62 Join Date: May 2010 Device: kindle 2, kindle 3, Kindle fire	Starson17, I was thinking the second method that you showed me was the method that was best suited for this situation. Actually, I wanted to learn both methods and have more tools/skills to work with in the future. Thanks for your continued support in this. I will work on what you have provided me with and get back with you when I have more questions. Once again I appreciate your time.