Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Closed Thread
 
Thread Tools Search this Thread
Old 08-31-2010, 03:49 PM   #2581
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by TonytheBookworm View Post
is there a better editor "free" than geany for python ? I swear these indents are driving me nuts. I was hoping there was some kinda compiler or whatever that would bark at me so I could ee what the issue actually was.
I don't have the answer for you, but I have seen a list of Python editors, including some free ones, so a Google might prove helpful. I use UltraEdit. It has three features I really like.

One is the ability to search defined folders and files. including subdirs for certain text, then open one or more of the located files. I often search *.recipe files in the resource directory for "keep_only" or "parse_index," etc, to see how other working recipes used those commands.

The second feature is having multiple files open for editing. I keep my recipe, my batch file for executing my recipe and my output error file all open.

The last feature is the ability to execute a batch file with a single keystroke. I have the batch file for executing the recipe connected to that key.

Modify recipe, save it, hit execute, read errors in error file, rinse and repeat.

I believe notepad++ is free and will do some of the above.
Starson17 is offline  
Old 08-31-2010, 06:58 PM   #2582
TonytheBookworm
Addict
TonytheBookworm is on a distinguished road
 
TonytheBookworm's Avatar
 
Posts: 264
Karma: 62
Join Date: May 2010
Device: kindle 2, kindle 3, Kindle fire
Hey Starson17 trying to apply what you had showed me on field and streams yet still a little confused.
Trying to play around with that http://www.laineygossip.com/ for the other user. I can get the other articles just fine using the methods that you showed me but I'm having trouble getting the ones that are not inside the <h2> tags. More specifically look at http://www.laineygossip.com/
Notice how it has the date then it goes dear gossipers, blah blah blah

well my thoughts were to take and do this to get the those articles then append it to the array then do another for loop to get the other articles that follow a different criteria

here is what i'm having an issue with
Spoiler:

Code:
def make_links(self, url):
        title = 'Temp'
        current_articles = []
        soup = self.index_to_soup(url)
        print 'The soup is: ', soup
        for t_item in soup.findAll('div', {"class":"leftcontent"}):
		  print 't_item is: ', t_item
		  title = t_item.h1.string
		  for content in t_item.findAll('div',  {"class":"artIntroShort"}):
		     print 'The content is: ', content 
		     art_text = t_item.p.string
		     print 'Art_text is :', art_text
		     link = t_item.find('a')
		     print 'The link is :', link
		     url = self.INDEX + link['href']
		     print 'The URL is :', url
		  
			
		  current_articles.append({'title': title, 'url': url, 'description':'', 'date':''}) # append all this

the articles are contained in the div class=leftcontent and the title is inside a h1 tag there. then i figured since i was inside the leftcontent due to the for look then i would then take and do another findall for the artIntroShort then parse it for the url and the article text that is in the <p> tag.....


here is the whole code i have thus far

Spoiler:

Code:
from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup
class GOSSIPBLOG(BasicNewsRecipe):
    title      = 'Gossip'
    __author__ = 'Tonythebookworm'
    description = 'Gossip'
    language = 'en'
    no_stylesheets = True
    publisher           = 'Tonythebookworm'
    category            = 'gossip'
    use_embedded_content= False
    no_stylesheets      = True
    oldest_article      = 24
    remove_javascript   = True
    remove_empty_feeds  = True
    # masthead_url        = ''
    # cover_url           = ''
    # recursions          = 0
    max_articles_per_feed = 10
    INDEX = 'http://www.laineygossip.com/'
    #keep_only_tags     = [dict(name='div', attrs={'class':['mainContent']})
    #                      ]
    #remove_tags = [dict(name='div', attrs={'id':['comments']})]
    
    def parse_index(self):
        feeds = []
        for title, url in [
                            (u"Gossip", u"http://www.laineygossip.com/"),
                            
                             ]:
            articles = self.make_links(url)
            if articles:
                feeds.append((title, articles))
        return feeds
        
    def make_links(self, url):
        title = 'Temp'
        current_articles = []
        soup = self.index_to_soup(url)
        print 'The soup is: ', soup
        for t_item in soup.findAll('div', {"class":"leftcontent"}):
		  print 't_item is: ', t_item
		  title = t_item.h1.string
		  for content in t_item.findAll('div',  {"class":"artIntroShort"}):
		     print 'The content is: ', content 
		     art_text = t_item.p.string
		     print 'Art_text is :', art_text
		     link = t_item.find('a')
		     print 'The link is :', link
		     url = self.INDEX + link['href']
		     print 'The URL is :', url
		  
			
		  current_articles.append({'title': title, 'url': url, 'description':'', 'date':''}) # append all this
        
        #---------------- next section ---------------------------------
        for item in soup.findAll('h2'):
          print 'item2 is: ', item
          link2 = item.find('a')
          print 'the link2 is: ', link2
          if link2:
            url         = self.INDEX + link2['href']
            print 'the title2 is: ', title
            print 'the url2 is: ', url
            current_articles.append({'title': title, 'url': url, 'description':'', 'date':''}) # append all this
          
        return current_articles


I know i'm close to getting this yet seem so far away.
TonytheBookworm is offline  
Advert
Old 08-31-2010, 08:21 PM   #2583
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by TonytheBookworm View Post
I know i'm close to getting this yet seem so far away.
Here is what I gave you last time. Why doesn't that work?
Spoiler:
Code:
        for item in soup.findAll('h2'):
            link = item.find('a')
            if link:


In line 1 it finds all the <h2> tags.
In line 2 it looks at each one to decide if there is an <a> tag inside.
In line 3, if there was an <a> tag found, it proceeds to do what needs to be done (look at the code I gave you again).
I looked at the http://www.laineygossip.com/ page and it seems to have the same structure, with <a> tags (having the link you want) inside <h2> tags.
Starson17 is offline  
Old 08-31-2010, 08:43 PM   #2584
TonytheBookworm
Addict
TonytheBookworm is on a distinguished road
 
TonytheBookworm's Avatar
 
Posts: 264
Karma: 62
Join Date: May 2010
Device: kindle 2, kindle 3, Kindle fire
Quote:
Originally Posted by Starson17 View Post
Here is what I gave you last time. Why doesn't that work?
Spoiler:
Code:
        for item in soup.findAll('h2'):
            link = item.find('a')
            if link:


In line 1 it finds all the <h2> tags.
In line 2 it looks at each one to decide if there is an <a> tag inside.
In line 3, if there was an <a> tag found, it proceeds to do what needs to be done (look at the code I gave you again).
I looked at the http://www.laineygossip.com/ page and it seems to have the same structure, with <a> tags (having the link you want) inside <h2> tags.
i think your missing what I was trying to ask or I asked it wrong. Yes your code works fine even on this page yet there is an exception which is what I'm having an issue with. The part that you mentioned with the for item in soup.findAll('h2') works great and I actually got that working fine. What my issue is is the first part of that where some of the articles are not within that structure. I will continue to work at it and see what I can come up with it. I really wanted to figure this one out using what you have taught me.

this is the part that is throwing me note it is not in the <h2> and <a> like the rest of the page is. I hope that explains what I mean. Hope I'm not bugging you on this. If so just say so and i'll chill

Spoiler:

Code:
<div class="artIntroShort">																						
			<p><span class="adpad300hp"><script src="http://ad.ca.doubleclick.net/adj/upt.laineygossip.home;tile=1;sz=300x250;ord={0}?" language="JavaScript1.1"></script><noscript>&lt;A HREF="http://ad.ca.doubleclick.net/jump/upt.laineygossip.home;tile=1;sz=300x250;ord={0}?" TARGET="_blank"&gt;&lt;IMG SRC="http://ad.ca.doubleclick.net/ad/upt.laineygossip.home;tile=1;sz=300x250;ord={0}?" BORDER="0" WIDTH="300" HEIGHT="250" ALT="Click Here" /&gt;&lt;/A&gt;</noscript></span>Dear Gossips,<br><br>Sorry to be a buzzkill but I think it’s the end of summer. Those science people may say it’s officially September 21 to mark the equinox but symbolically, for most of us, it’s really the start of school, even when we’re not in school. Or the Venice Film Festival when the stars get back to work, leading straight into TIFF and the VMAs and Fashion Week and then the fall movie schedule which is really when the jostling begins. That’s tomorrow, and it brings to end the slow season of celebrity. <br><br>Like clockwork then, Vanity Fair is releasing excerpts from their Lindsay Lohan exclusive and tabloid Wednesday tomorrow should be even bullsh-ttier than usual. <a href="/intro_31aug10.aspx?CatID=0&amp;CelID=0">Full Intro</a></p>											
			<p></p>
			<p class="comment">Posted at 6:53 AM</p>
		    </div>
TonytheBookworm is offline  
Old 08-31-2010, 09:06 PM   #2585
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by TonytheBookworm View Post
i think your missing what I was trying to ask
I did.
Quote:
this is the part that is throwing me note it is not in the <h2> and <a> like the rest of the page is.
It looks like it's an <a> tag inside a <div class="artIntroShort"> tag. Correct? Then just do the same thing you did with an <a> tag inside an <h2> tag, but use div instead of h2 and specify the class. That should do it.

Quote:
Hope I'm not bugging you on this. If so just say so and i'll chill
Nope. I can stop at any time. I like to see others with the same interest I have.

Edit: looking back at your code, I see that's sort of what you did, but you have an extra for loop layer at the leftcontent that I don't think you need.

Last edited by Starson17; 08-31-2010 at 09:11 PM.
Starson17 is offline  
Advert
Old 08-31-2010, 10:07 PM   #2586
TonytheBookworm
Addict
TonytheBookworm is on a distinguished road
 
TonytheBookworm's Avatar
 
Posts: 264
Karma: 62
Join Date: May 2010
Device: kindle 2, kindle 3, Kindle fire
Quote:
Originally Posted by Starson17 View Post
I did.


It looks like it's an <a> tag inside a <div class="artIntroShort"> tag. Correct? Then just do the same thing you did with an <a> tag inside an <h2> tag, but use div instead of h2 and specify the class. That should do it.


Nope. I can stop at any time. I like to see others with the same interest I have.

Edit: looking back at your code, I see that's sort of what you did, but you have an extra for loop layer at the leftcontent that I don't think you need.
Alright as you mentioned I had an extra for loop layer. I removed that and modified the code a little bit and it works fine with one exception. The reason I had done the extra for loop was to snag the title Which is the date at the top that was inside the leftcontent as a h1 tag... My logic was this. Get the title via the first for loop then once i get it start another for loop to get the content for that title. So would I get the title first in a single for loop ? then append then turn around and run the for loop for content, then append, then do the other for loop that looks for the <h2> stuff ?
Basically this works fine to get the none <h2> stuff with the exception of the title:
Spoiler:

Code:
for content in soup.findAll('div',  {"class":"artIntroShort"}):
		     print 'The content is: ', content 
		     art_text = content.find('p')
		     print 'Art_text is :', art_text
		     link = content.find('a')
		     print 'The link is :', link
		     url = self.INDEX + link['href']
		     print 'The URL is :', url
		     current_articles.append({'title': title, 'url': url, 'description':'', 'date':''}) # append all this

my question is how would i get something like this to work ?
Spoiler:

Code:
#-------------------------------------------------------
# this for loop is trying to get the title                         
for t_item in soup.findAll('div', {"class":"leftcontent"}):
		  print 't_item is: ', t_item
		  rawh1 = t_item.find('h1')
		  title = self.tag_to_string(rawh1)
		  print 'rawh1 title is: ', title
#indent might not show right on here but this should be
#an independent for loop
#=------------------------------------------------------		

#-------------------next get the non <h2> content  this works 
  
        for content in soup.findAll('div',  {"class":"artIntroShort"}):
		     print 'The content is: ', content 
		     art_text = content.find('p')
		     print 'Art_text is :', art_text
		     link = content.find('a')
		     print 'The link is :', link
		     url = self.INDEX + link['href']
		     print 'The URL is :', url
		     current_articles.append({'title': title, 'url': url, 'description':'', 'date':''}) # append all this
#------------------------------------------------------------------------



of course I have the return statements and all but this is the block that i'm concerned about and thanks

also. i'm noticing that there are <span> tags inside the <p> tags so when i do for a search for the <a> inside the <p> i get the dang links for the ads instead of the last <a> tag... this one i tell you is really working the brain. be interesting how this works out... I lookeed at the output log and notice like i said it keeps making the url is: to the ad.doubleclick thing that is inside the <span> i tried taking and doing a remove_tags on that tag but apparently it doesn't remove the tag till after it goes through the parsing.

Last edited by TonytheBookworm; 08-31-2010 at 11:29 PM. Reason: added more info
TonytheBookworm is offline  
Old 09-01-2010, 09:17 AM   #2587
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by TonytheBookworm View Post
The reason I had done the extra for loop was to snag the title ....
When I get a chance, I'll try to look this over, but as I'm sure you are aware, tehre's no substitute for a careful look at the structure of the page you are scraping. If an extra for loop works for you, that's fine.
Quote:
also. i'm noticing that there are <span> tags inside the <p> tags so when i do for a search for the <a> inside the <p> i get the dang links for the ads instead of the last <a> tag... this one i tell you is really working the brain.
Again, there may be a better way to locate your <a> tag by carefully studying the source page structure, but if you don't see one, then you can simply test each <a> tag you find. You can check to see if the <a> tag is embedded in a <span> tag using the "parent" test of Beautiful Soup. If the parent of the <a> tag is a span tag, skip it, and search again to get the second <a> tag, etc.

I'll leave you to play with that. I'm sure a closer look at your code and the page you're scraping would let me make better comments, but I'm short on time today. Good Luck!
Starson17 is offline  
Old 09-01-2010, 11:10 AM   #2588
c.espinosas
Junior Member
c.espinosas began at the beginning.
 
Posts: 6
Karma: 10
Join Date: Sep 2010
Device: entourage edge
request of Milenio recipe

Hi!
I'd like to ask if someone has a recipe for Milenio Diario (mexican newspaper, http://impreso.milenio.com/Nacional/)
Opinion articles are not included in the RSS feeds, but I'd like them in the recipe...
Thanks a lot!
Cheers
c.espinosas is offline  
Old 09-01-2010, 07:45 PM   #2589
TonytheBookworm
Addict
TonytheBookworm is on a distinguished road
 
TonytheBookworm's Avatar
 
Posts: 264
Karma: 62
Join Date: May 2010
Device: kindle 2, kindle 3, Kindle fire
Quote:
Originally Posted by Starson17 View Post
I'll leave you to play with that. I'm sure a closer look at your code and the page you're scraping would let me make better comments, but I'm short on time today. Good Luck!
I took and did the "parent" thing that you mentioned and it worked. I have a couple issues that are probably simple fixes (I hope). Yet, I can't seem to grasp what is happening even after looking at the output log.

Issue I'm having: 1) For whatever reason I always get a full run of the whole page as an article not sure why this is unless it searches for artIntroShort and then the <a> tags and doesn't find any (the webmaster isn't consistent) so as a result My guess is somehow (I can't seem to find it in my output log) BUT it takes and link['href'] ends up being NONE so the url ends up just being the INDEX.
2) This one is really the one that is puzzling me the most. I also see the person that asked for someone to help on this recipe faced a similar problem with the xml (that is why i didn't use the feed was trying this method to get the thumbnails). but for some reason The thumbnails don't come through. I looked in firebug and they appear to be wrapped inside the mainContent tag. I even went as far as taking and commenting out the keep only tags and was faced with the same results.

Anyway, whenever you get some free time have a look at this if you don't mind. thanks!!!

Attached: Code that gets articles but has issues
Attached Files
File Type: rar gtest.rar (1,010 Bytes, 251 views)
TonytheBookworm is offline  
Old 09-01-2010, 10:16 PM   #2590
TonytheBookworm
Addict
TonytheBookworm is on a distinguished road
 
TonytheBookworm's Avatar
 
Posts: 264
Karma: 62
Join Date: May 2010
Device: kindle 2, kindle 3, Kindle fire
Forgive me for asking so many questions. Pretty much the only way I know to learn. With that being said. I was wondering how would one parse a website that does something like this
Article Content then has a pagenation to go to the rest of the article then continue to the rest of the article and yet keep it all in one article?

Basically lets say you had
page 1:
blah blah blah test blah blah
next page
page 2:
more stuff for same article
next page

how would you do that? My first guess would be using parse_index() then somehow call the article up and get the articlecontent then somehow take and do a find to get the <a> inside that article then get the content and append it to that article?

To get a better idea of what I'm talking about have a look at:
http://auto.howstuffworks.com/under-...-insurance.htm which is part of the http://feeds.feedburner.com/Howstuff...ffDailyRssFeed feed

notice how it shows kinda like a description if you will then next page then shows more then next page and so forth? I think once I get some general templates on how this stuff works that (i can understand) then I'll be fine.
TonytheBookworm is offline  
Old 09-02-2010, 09:34 AM   #2591
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by TonytheBookworm View Post
I was wondering how would one parse a website that does something like this
Article Content then has a pagenation to go to the rest of the article then continue to the rest of the article and yet keep it all in one article?...
how would you do that? My first guess would be using parse_index()
I refer to this as a "multipage" article. No, you don't use parse_index. You use parse_index when you don't have an RSS feed and need to build your own feed by scraping. The multipage problem occurs later, when the articles in the feeds are actually being processed. At that point, you already have the feed (you might have gotten it by a normal RSS feed or by scraping and building your own with parse_index - it doesn't matter how).

Briefly, in multipage you use BeautifulSoup to grab each subsequent page by following the "next page" links and you append them all into the soup for the first page to make a large single BS object. Search this thread for "multipage." Look at the discussion I had with "rty" to see some examples. Search the builtin recipes for "append_page" or search here for that and you will find many examples of how-to.
Starson17 is offline  
Old 09-02-2010, 09:45 AM   #2592
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by TonytheBookworm View Post
1) For whatever reason I always get a full run of the whole page as an article not sure why this is unless it searches for artIntroShort and then the <a> tags and doesn't find any (the webmaster isn't consistent) so as a result My guess is somehow (I can't seem to find it in my output log) BUT it takes and link['href'] ends up being NONE so the url ends up just being the INDEX.
You've probably got too may print statements in there. You do realize they are only there for debugging - right? Just comment out the ones you are not interested in and add more until you find your problem.

Quote:
2) This one is really the one that is puzzling me the most. I also see the person that asked for someone to help on this recipe faced a similar problem with the xml (that is why i didn't use the feed was trying this method to get the thumbnails). but for some reason The thumbnails don't come through. I looked in firebug and they appear to be wrapped inside the mainContent tag. I even went as far as taking and commenting out the keep only tags and was faced with the same results.
I briefly looked at someone's question about missing thumbnail images. I can't tell you (yet) what's going on, but here's my process:

1) If something isn't appearing, make sure your own keep_only or remove_tags aren't stripping it. Try to get it to appear with all the other junk.
2) Maybe it's being removed with removal of scripting. Look at the page source to see. Try leaving scripts on in your test recipe.
3) If it still looks like the item should be picked up, sometimes the site is protecting the image from scraping. You may need to have the correct useragent, the correct cookie, the correct referer header, etc. FireFox and TamperData help here. There are techniques for simulating each of these. I try to get FireFox to act like Calibre (or vice-versa) to verify.

The bottom line is that if FireFox can see it, so can your recipe.
Starson17 is offline  
Old 09-02-2010, 01:28 PM   #2593
TonytheBookworm
Addict
TonytheBookworm is on a distinguished road
 
TonytheBookworm's Avatar
 
Posts: 264
Karma: 62
Join Date: May 2010
Device: kindle 2, kindle 3, Kindle fire
Been looking at the AventureGamer code and I have a few questions.

Spoiler:

Code:
def preprocess_html(self, soup):
       mtag = '<meta http-equiv="Content-Language" content="en-US"/>\n<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>'
       soup.head.insert(0,mtag)
what is the reason for inserting the meta tag ?
Code:
       for item in soup.findAll(style=True):
           del item['style']
why is the above used? It appears to remove all instance of style but why is it needed?
Code:
       self.append_page(soup, soup.body, 3)
I'm not really clear on this. It appears to me that you are taking the whole soup. appending to the body of the soup with a position of 3?

Code:
       pager = soup.find('div',attrs={'class':'toolbar_fat'})
       if pager:
          pager.extract()
I looked in the code and didn't see why the extraction of this is needed. Because the navigation appears to be inside toolbar_fat_next


and here is my painful attempt
Spoiler:

Code:
from calibre.web.feeds.news import BasicNewsRecipe

class AdvancedUserRecipe1282101454(BasicNewsRecipe):
    title = 'How Stuff Works'
    language = 'en'
    __author__ = 'TonytheBookworm'
    description = 'How stuff works'
    publisher = 'Tony'
    category = 'information'
    oldest_article = 10
    max_articles_per_feed = 100
    no_stylesheets = True
    #INDEX                 = u'http://www.adventuregamers.com'
    #extra_css = '.headline {font-size: x-large;} \n .fact { padding-top: 10pt }'
    #masthead_url = 'http://gawand.org/wp-content/uploads/2010/06/ajc-logo.gif'
    #keep_only_tags    = [
     #                    dict(name='div', attrs={'class':['blogEntryHeader','blogEntryContent']})
      #                 ,dict(attrs={'id':['cxArticleText','cxArticleBodyText']})
      #                  ]
    feeds          = [
                      ('AutoStuff', 'http://feeds.feedburner.com/HowstuffworksAutostuffDailyRssFeed'),
                      
                    ]

   
        
        
    def append_page(self, soup, appendtag, position):
        pager = soup.find('div',attrs={'class':'pagination'})
        if pager:
           nexturl = pager.a['href']
           soup2 = self.index_to_soup(nexturl)
           texttag = soup2.find('div', attrs={'class':'content'})
           for it in texttag.findAll(style=True):
               del it['style']
           newpos = len(texttag.contents)          
           self.append_page(soup2,texttag,newpos)
           texttag.extract()
           appendtag.insert(position,texttag)     

    def preprocess_html(self, soup):
       mtag = '<meta http-equiv="Content-Language" content="en-US"/>\n<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>'
       soup.head.insert(0,mtag)    
       for item in soup.findAll(style=True):
           del item['style']
       self.append_page(soup, soup.body, 3)
       pager = soup.find('div',attrs={'class':'toolbar_fat'})
       if pager:
          pager.extract()        
        return soup
TonytheBookworm is offline  
Old 09-02-2010, 01:55 PM   #2594
kiklop74
Guru
kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.
 
kiklop74's Avatar
 
Posts: 800
Karma: 194644
Join Date: Dec 2007
Location: Argentina
Device: Kindle Voyage
Quote:
Originally Posted by TonytheBookworm View Post
Been looking at the AventureGamer code and I have a few questions.
Quote:
Originally Posted by TonytheBookworm View Post
Code:
def preprocess_html(self, soup):
       mtag = '<meta http-equiv="Content-Language" content="en-US"/>\n<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>'
       soup.head.insert(0,mtag)
what is the reason for inserting the meta tag ?
That was my early experiment with soup, but now it is not needed and I do not put it in new recipes. You can just ignore it.

Quote:
Originally Posted by TonytheBookworm View Post
Code:
       for item in soup.findAll(style=True):
           del item['style']
why is the above used? It appears to remove all instance of style but why is it needed?
This is needed to remove all style codes which usualy specify some text properties. We need as raw text as possible without any styles whatsoever.


Quote:
Code:
       self.append_page(soup, soup.body, 3)
I'm not really clear on this. It appears to me that you are taking the whole soup. appending to the body of the soup with a position of 3?

Code:
       pager = soup.find('div',attrs={'class':'toolbar_fat'})
       if pager:
          pager.extract()
I looked in the code and didn't see why the extraction of this is needed. Because the navigation appears to be inside toolbar_fat_next
This would reaquire a bit longer explanation but to shorten it I'm basically making multipage articles into one. The other code example deletes all div's with class toolbar_fat and I remove because we do not need to see navigation as everything is tied into one uniform article.
kiklop74 is offline  
Old 09-02-2010, 02:55 PM   #2595
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by TonytheBookworm View Post
Been looking at the AventureGamer code and I have a few questions.
You got direct answers from the expert/author, but I'll add a bit. I learned about multipage by studying the same AdventureGamer recipe.
Spoiler:

Quote:
Code:
       for item in soup.findAll(style=True):
           del item['style']
why is the above used? It appears to remove all instance of style but why is it needed?
I think of it as optional, but preferable to start with a clean base and then apply styles with the extra CSS. It makes your recipes more consistent. You're right, though, sometimes there can be style info that you may want to preserve

Quote:
Code:
       self.append_page(soup, soup.body, 3)
I'm not really clear on this. It appears to me that you are taking the whole soup. appending to the body of the soup with a position of 3?
Have you noticed how cool this is? append_page is recursive! It's calling itself. At this point, it's being called to start the append at position 3 (after the basic html, head and body tags), but when it calls itself, it's appending the next page at a different point, not at 3, but where
Code:
newpos = len(texttag.contents)
Quote:
Code:
       pager = soup.find('div',attrs={'class':'toolbar_fat'})
       if pager:
          pager.extract()
I looked in the code and didn't see why the extraction of this is needed. Because the navigation appears to be inside toolbar_fat_next
Without answering your question, let me tell you that extract() confused me when I first ran into it, so I'll just give you a reminder of what it does. If "pager" identifies a tag (and all child tags) in "soup," then extracting it gives you two totally separate and disentangled soups. The first is the original soup, without any connection to what was in pager, and the second is pager, without any connection to what was in soup. You can use them for whatever you want. extract() can be used to to delete stuff from soup, or to give you a cleaned up pager, which you may want to use on its own.
Starson17 is offline  
Closed Thread


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Custom column read ? pchrist7 Calibre 2 10-04-2010 02:52 AM
Archive for custom screensavers sleeplessdave Amazon Kindle 1 07-07-2010 12:33 PM
How to back up preferences and custom recipes? greenapple Calibre 3 03-29-2010 05:08 AM
Donations for Custom Recipes ddavtian Calibre 5 01-23-2010 04:54 PM
Help understanding custom recipes andersent Calibre 0 12-17-2009 02:37 PM


All times are GMT -4. The time now is 12:55 PM.


MobileRead.com is a privately owned, operated and funded community.