Custom recipes (archive, read-only) - Page 173

Starson17 · 08-31-2010, 03:49 PM

Quote:

Originally Posted by TonytheBookworm

is there a better editor "free" than geany for python ? I swear these indents are driving me nuts. I was hoping there was some kinda compiler or whatever that would bark at me so I could ee what the issue actually was.

I don't have the answer for you, but I have seen a list of Python editors, including some free ones, so a Google might prove helpful. I use UltraEdit. It has three features I really like.

One is the ability to search defined folders and files. including subdirs for certain text, then open one or more of the located files. I often search *.recipe files in the resource directory for "keep_only" or "parse_index," etc, to see how other working recipes used those commands.

The second feature is having multiple files open for editing. I keep my recipe, my batch file for executing my recipe and my output error file all open.

The last feature is the ability to execute a batch file with a single keystroke. I have the batch file for executing the recipe connected to that key.

Modify recipe, save it, hit execute, read errors in error file, rinse and repeat.

I believe notepad++ is free and will do some of the above.

TonytheBookworm · 08-31-2010, 06:58 PM

Hey Starson17 trying to apply what you had showed me on field and streams yet still a little confused.
Trying to play around with that http://www.laineygossip.com/ for the other user. I can get the other articles just fine using the methods that you showed me but I'm having trouble getting the ones that are not inside the <h2> tags. More specifically look at http://www.laineygossip.com/
Notice how it has the date then it goes dear gossipers, blah blah blah

well my thoughts were to take and do this to get the those articles then append it to the array then do another for loop to get the other articles that follow a different criteria

here is what i'm having an issue with

Spoiler:

the articles are contained in the div class=leftcontent and the title is inside a h1 tag there. then i figured since i was inside the leftcontent due to the for look then i would then take and do another findall for the artIntroShort then parse it for the url and the article text that is in the <p> tag.....

here is the whole code i have thus far

Spoiler:

Code:

from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup
class GOSSIPBLOG(BasicNewsRecipe):
    title      = 'Gossip'
    __author__ = 'Tonythebookworm'
    description = 'Gossip'
    language = 'en'
    no_stylesheets = True
    publisher           = 'Tonythebookworm'
    category            = 'gossip'
    use_embedded_content= False
    no_stylesheets      = True
    oldest_article      = 24
    remove_javascript   = True
    remove_empty_feeds  = True
    # masthead_url        = ''
    # cover_url           = ''
    # recursions          = 0
    max_articles_per_feed = 10
    INDEX = 'http://www.laineygossip.com/'
    #keep_only_tags     = [dict(name='div', attrs={'class':['mainContent']})
    #                      ]
    #remove_tags = [dict(name='div', attrs={'id':['comments']})]
    
    def parse_index(self):
        feeds = []
        for title, url in [
                            (u"Gossip", u"http://www.laineygossip.com/"),
                            
                             ]:
            articles = self.make_links(url)
            if articles:
                feeds.append((title, articles))
        return feeds
        
    def make_links(self, url):
        title = 'Temp'
        current_articles = []
        soup = self.index_to_soup(url)
        print 'The soup is: ', soup
        for t_item in soup.findAll('div', {"class":"leftcontent"}):
		  print 't_item is: ', t_item
		  title = t_item.h1.string
		  for content in t_item.findAll('div',  {"class":"artIntroShort"}):
		     print 'The content is: ', content 
		     art_text = t_item.p.string
		     print 'Art_text is :', art_text
		     link = t_item.find('a')
		     print 'The link is :', link
		     url = self.INDEX + link['href']
		     print 'The URL is :', url
		  
			
		  current_articles.append({'title': title, 'url': url, 'description':'', 'date':''}) # append all this
        
        #---------------- next section ---------------------------------
        for item in soup.findAll('h2'):
          print 'item2 is: ', item
          link2 = item.find('a')
          print 'the link2 is: ', link2
          if link2:
            url         = self.INDEX + link2['href']
            print 'the title2 is: ', title
            print 'the url2 is: ', url
            current_articles.append({'title': title, 'url': url, 'description':'', 'date':''}) # append all this
          
        return current_articles

I know i'm close to getting this yet seem so far away.

Starson17 · 08-31-2010, 08:21 PM

Quote:

Originally Posted by TonytheBookworm

I know i'm close to getting this yet seem so far away.

Here is what I gave you last time. Why doesn't that work?

Spoiler:

In line 1 it finds all the <h2> tags.
In line 2 it looks at each one to decide if there is an <a> tag inside.
In line 3, if there was an <a> tag found, it proceeds to do what needs to be done (look at the code I gave you again).
I looked at the http://www.laineygossip.com/ page and it seems to have the same structure, with <a> tags (having the link you want) inside <h2> tags.

TonytheBookworm · 08-31-2010, 08:43 PM

Quote:

Originally Posted by Starson17

Here is what I gave you last time. Why doesn't that work?

Spoiler:

In line 1 it finds all the <h2> tags.
In line 2 it looks at each one to decide if there is an <a> tag inside.
In line 3, if there was an <a> tag found, it proceeds to do what needs to be done (look at the code I gave you again).
I looked at the http://www.laineygossip.com/ page and it seems to have the same structure, with <a> tags (having the link you want) inside <h2> tags.

i think your missing what I was trying to ask or I asked it wrong. Yes your code works fine even on this page yet there is an exception which is what I'm having an issue with. The part that you mentioned with the for item in soup.findAll('h2') works great and I actually got that working fine. What my issue is is the first part of that where some of the articles are not within that structure. I will continue to work at it and see what I can come up with it. I really wanted to figure this one out using what you have taught me.

this is the part that is throwing me note it is not in the <h2> and <a> like the rest of the page is. I hope that explains what I mean. Hope I'm not bugging you on this. If so just say so and i'll chill

Spoiler:

Starson17 · 08-31-2010, 09:06 PM

Quote:

Originally Posted by TonytheBookworm

i think your missing what I was trying to ask

I did.

Quote:

this is the part that is throwing me note it is not in the <h2> and <a> like the rest of the page is.

It looks like it's an <a> tag inside a <div class="artIntroShort"> tag. Correct? Then just do the same thing you did with an <a> tag inside an <h2> tag, but use div instead of h2 and specify the class. That should do it.

Quote:

Hope I'm not bugging you on this. If so just say so and i'll chill

Nope. I can stop at any time. I like to see others with the same interest I have.

Edit: looking back at your code, I see that's sort of what you did, but you have an extra for loop layer at the leftcontent that I don't think you need.

TonytheBookworm · 08-31-2010, 10:07 PM

Quote:

Originally Posted by Starson17

I did.

It looks like it's an <a> tag inside a <div class="artIntroShort"> tag. Correct? Then just do the same thing you did with an <a> tag inside an <h2> tag, but use div instead of h2 and specify the class. That should do it.

Nope. I can stop at any time. I like to see others with the same interest I have.

Edit: looking back at your code, I see that's sort of what you did, but you have an extra for loop layer at the leftcontent that I don't think you need.

Alright as you mentioned I had an extra for loop layer. I removed that and modified the code a little bit and it works fine with one exception. The reason I had done the extra for loop was to snag the title Which is the date at the top that was inside the leftcontent as a h1 tag... My logic was this. Get the title via the first for loop then once i get it start another for loop to get the content for that title. So would I get the title first in a single for loop ? then append then turn around and run the for loop for content, then append, then do the other for loop that looks for the <h2> stuff ?
Basically this works fine to get the none <h2> stuff with the exception of the title:

Spoiler:

my question is how would i get something like this to work ?

Spoiler:

of course I have the return statements and all but this is the block that i'm concerned about and thanks

also. i'm noticing that there are <span> tags inside the <p> tags so when i do for a search for the <a> inside the <p> i get the dang links for the ads instead of the last <a> tag... this one i tell you is really working the brain. be interesting how this works out... I lookeed at the output log and notice like i said it keeps making the url is: to the ad.doubleclick thing that is inside the <span> i tried taking and doing a remove_tags on that tag but apparently it doesn't remove the tag till after it goes through the parsing.

Starson17 · 09-01-2010, 09:17 AM

Quote:

Originally Posted by TonytheBookworm

The reason I had done the extra for loop was to snag the title ....

When I get a chance, I'll try to look this over, but as I'm sure you are aware, tehre's no substitute for a careful look at the structure of the page you are scraping. If an extra for loop works for you, that's fine.

Quote:

also. i'm noticing that there are <span> tags inside the <p> tags so when i do for a search for the <a> inside the <p> i get the dang links for the ads instead of the last <a> tag... this one i tell you is really working the brain.

Again, there may be a better way to locate your <a> tag by carefully studying the source page structure, but if you don't see one, then you can simply test each <a> tag you find. You can check to see if the <a> tag is embedded in a <span> tag using the "parent" test of Beautiful Soup. If the parent of the <a> tag is a span tag, skip it, and search again to get the second <a> tag, etc.

I'll leave you to play with that. I'm sure a closer look at your code and the page you're scraping would let me make better comments, but I'm short on time today. Good Luck!

c.espinosas · 09-01-2010, 11:10 AM

Hi!
I'd like to ask if someone has a recipe for Milenio Diario (mexican newspaper, http://impreso.milenio.com/Nacional/)
Opinion articles are not included in the RSS feeds, but I'd like them in the recipe...
Thanks a lot!
Cheers

TonytheBookworm · 09-01-2010, 07:45 PM

Quote:

Originally Posted by Starson17

I'll leave you to play with that. I'm sure a closer look at your code and the page you're scraping would let me make better comments, but I'm short on time today. Good Luck!

I took and did the "parent" thing that you mentioned and it worked. I have a couple issues that are probably simple fixes (I hope). Yet, I can't seem to grasp what is happening even after looking at the output log.

Issue I'm having: 1) For whatever reason I always get a full run of the whole page as an article not sure why this is unless it searches for artIntroShort and then the <a> tags and doesn't find any (the webmaster isn't consistent) so as a result My guess is somehow (I can't seem to find it in my output log) BUT it takes and link['href'] ends up being NONE so the url ends up just being the INDEX.
2) This one is really the one that is puzzling me the most. I also see the person that asked for someone to help on this recipe faced a similar problem with the xml (that is why i didn't use the feed was trying this method to get the thumbnails). but for some reason The thumbnails don't come through. I looked in firebug and they appear to be wrapped inside the mainContent tag. I even went as far as taking and commenting out the keep only tags and was faced with the same results.

Anyway, whenever you get some free time have a look at this if you don't mind. thanks!!!

Attached: Code that gets articles but has issues

TonytheBookworm · 09-01-2010, 10:16 PM

Forgive me for asking so many questions. Pretty much the only way I know to learn. With that being said. I was wondering how would one parse a website that does something like this
Article Content then has a pagenation to go to the rest of the article then continue to the rest of the article and yet keep it all in one article?

Basically lets say you had
page 1:
blah blah blah test blah blah
next page
page 2:
more stuff for same article
next page

how would you do that? My first guess would be using parse_index() then somehow call the article up and get the articlecontent then somehow take and do a find to get the <a> inside that article then get the content and append it to that article?

To get a better idea of what I'm talking about have a look at:
http://auto.howstuffworks.com/under-...-insurance.htm which is part of the http://feeds.feedburner.com/Howstuff...ffDailyRssFeed feed

notice how it shows kinda like a description if you will then next page then shows more then next page and so forth? I think once I get some general templates on how this stuff works that (i can understand) then I'll be fine.

Starson17 · 09-02-2010, 09:34 AM

Quote:

Originally Posted by TonytheBookworm

I was wondering how would one parse a website that does something like this
Article Content then has a pagenation to go to the rest of the article then continue to the rest of the article and yet keep it all in one article?...
how would you do that? My first guess would be using parse_index()

I refer to this as a "multipage" article. No, you don't use parse_index. You use parse_index when you don't have an RSS feed and need to build your own feed by scraping. The multipage problem occurs later, when the articles in the feeds are actually being processed. At that point, you already have the feed (you might have gotten it by a normal RSS feed or by scraping and building your own with parse_index - it doesn't matter how).

Briefly, in multipage you use BeautifulSoup to grab each subsequent page by following the "next page" links and you append them all into the soup for the first page to make a large single BS object. Search this thread for "multipage." Look at the discussion I had with "rty" to see some examples. Search the builtin recipes for "append_page" or search here for that and you will find many examples of how-to.

Starson17 · 09-02-2010, 09:45 AM

Quote:

Originally Posted by TonytheBookworm

1) For whatever reason I always get a full run of the whole page as an article not sure why this is unless it searches for artIntroShort and then the <a> tags and doesn't find any (the webmaster isn't consistent) so as a result My guess is somehow (I can't seem to find it in my output log) BUT it takes and link['href'] ends up being NONE so the url ends up just being the INDEX.

You've probably got too may print statements in there. You do realize they are only there for debugging - right? Just comment out the ones you are not interested in and add more until you find your problem.

Quote:

2) This one is really the one that is puzzling me the most. I also see the person that asked for someone to help on this recipe faced a similar problem with the xml (that is why i didn't use the feed was trying this method to get the thumbnails). but for some reason The thumbnails don't come through. I looked in firebug and they appear to be wrapped inside the mainContent tag. I even went as far as taking and commenting out the keep only tags and was faced with the same results.

I briefly looked at someone's question about missing thumbnail images. I can't tell you (yet) what's going on, but here's my process:

1) If something isn't appearing, make sure your own keep_only or remove_tags aren't stripping it. Try to get it to appear with all the other junk.
2) Maybe it's being removed with removal of scripting. Look at the page source to see. Try leaving scripts on in your test recipe.
3) If it still looks like the item should be picked up, sometimes the site is protecting the image from scraping. You may need to have the correct useragent, the correct cookie, the correct referer header, etc. FireFox and TamperData help here. There are techniques for simulating each of these. I try to get FireFox to act like Calibre (or vice-versa) to verify.

The bottom line is that if FireFox can see it, so can your recipe.

TonytheBookworm · 09-02-2010, 01:28 PM

Been looking at the AventureGamer code and I have a few questions.

Spoiler:

and here is my painful attempt

Spoiler:

Code:

from calibre.web.feeds.news import BasicNewsRecipe

class AdvancedUserRecipe1282101454(BasicNewsRecipe):
    title = 'How Stuff Works'
    language = 'en'
    __author__ = 'TonytheBookworm'
    description = 'How stuff works'
    publisher = 'Tony'
    category = 'information'
    oldest_article = 10
    max_articles_per_feed = 100
    no_stylesheets = True
    #INDEX                 = u'http://www.adventuregamers.com'
    #extra_css = '.headline {font-size: x-large;} \n .fact { padding-top: 10pt }'
    #masthead_url = 'http://gawand.org/wp-content/uploads/2010/06/ajc-logo.gif'
    #keep_only_tags    = [
     #                    dict(name='div', attrs={'class':['blogEntryHeader','blogEntryContent']})
      #                 ,dict(attrs={'id':['cxArticleText','cxArticleBodyText']})
      #                  ]
    feeds          = [
                      ('AutoStuff', 'http://feeds.feedburner.com/HowstuffworksAutostuffDailyRssFeed'),
                      
                    ]

   
        
        
    def append_page(self, soup, appendtag, position):
        pager = soup.find('div',attrs={'class':'pagination'})
        if pager:
           nexturl = pager.a['href']
           soup2 = self.index_to_soup(nexturl)
           texttag = soup2.find('div', attrs={'class':'content'})
           for it in texttag.findAll(style=True):
               del it['style']
           newpos = len(texttag.contents)          
           self.append_page(soup2,texttag,newpos)
           texttag.extract()
           appendtag.insert(position,texttag)     

    def preprocess_html(self, soup):
       mtag = '<meta http-equiv="Content-Language" content="en-US"/>\n<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>'
       soup.head.insert(0,mtag)    
       for item in soup.findAll(style=True):
           del item['style']
       self.append_page(soup, soup.body, 3)
       pager = soup.find('div',attrs={'class':'toolbar_fat'})
       if pager:
          pager.extract()        
        return soup

kiklop74 · 09-02-2010, 01:55 PM

Quote:

Originally Posted by TonytheBookworm

Been looking at the AventureGamer code and I have a few questions.

Quote:

Originally Posted by TonytheBookworm

Code:

def preprocess_html(self, soup):
       mtag = '<meta http-equiv="Content-Language" content="en-US"/>\n<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>'
       soup.head.insert(0,mtag)

what is the reason for inserting the meta tag ?

That was my early experiment with soup, but now it is not needed and I do not put it in new recipes. You can just ignore it.

Quote:

Originally Posted by TonytheBookworm

Code:

       for item in soup.findAll(style=True):
           del item['style']

why is the above used? It appears to remove all instance of style but why is it needed?

This is needed to remove all style codes which usualy specify some text properties. We need as raw text as possible without any styles whatsoever.

Quote:

Code:

       self.append_page(soup, soup.body, 3)

I'm not really clear on this. It appears to me that you are taking the whole soup. appending to the body of the soup with a position of 3?

Code:

       pager = soup.find('div',attrs={'class':'toolbar_fat'})
       if pager:
          pager.extract()

I looked in the code and didn't see why the extraction of this is needed. Because the navigation appears to be inside toolbar_fat_next

This would reaquire a bit longer explanation but to shorten it I'm basically making multipage articles into one. The other code example deletes all div's with class toolbar_fat and I remove because we do not need to see navigation as everything is tied into one uniform article.

Starson17 · 09-02-2010, 02:55 PM

Quote:

Originally Posted by TonytheBookworm

Been looking at the AventureGamer code and I have a few questions.

You got direct answers from the expert/author, but I'll add a bit. I learned about multipage by studying the same AdventureGamer recipe.

Spoiler:

09-01-2010, 11:10 AM	#2588
c.espinosas Junior Member Posts: 6 Karma: 10 Join Date: Sep 2010 Device: entourage edge	request of Milenio recipe Hi! I'd like to ask if someone has a recipe for Milenio Diario (mexican newspaper, http://impreso.milenio.com/Nacional/) Opinion articles are not included in the RSS feeds, but I'd like them in the recipe... Thanks a lot! Cheers

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Custom column read ?	pchrist7	Calibre	2	10-04-2010 02:52 AM
Archive for custom screensavers	sleeplessdave	Amazon Kindle	1	07-07-2010 12:33 PM
How to back up preferences and custom recipes?	greenapple	Calibre	3	03-29-2010 05:08 AM
Donations for Custom Recipes	ddavtian	Calibre	5	01-23-2010 04:54 PM
Help understanding custom recipes	andersent	Calibre	0	12-17-2009 02:37 PM

09-01-2010, 10:16 PM	#2590
TonytheBookworm Addict Posts: 264 Karma: 62 Join Date: May 2010 Device: kindle 2, kindle 3, Kindle fire	Forgive me for asking so many questions. Pretty much the only way I know to learn. With that being said. I was wondering how would one parse a website that does something like this Article Content then has a pagenation to go to the rest of the article then continue to the rest of the article and yet keep it all in one article? Basically lets say you had page 1: blah blah blah test blah blah next page page 2: more stuff for same article next page how would you do that? My first guess would be using parse_index() then somehow call the article up and get the articlecontent then somehow take and do a find to get the <a> inside that article then get the content and append it to that article? To get a better idea of what I'm talking about have a look at: http://auto.howstuffworks.com/under-...-insurance.htm which is part of the http://feeds.feedburner.com/Howstuff...ffDailyRssFeed feed notice how it shows kinda like a description if you will then next page then shows more then next page and so forth? I think once I get some general templates on how this stuff works that (i can understand) then I'll be fine.

Advert

Advert