Custom recipes (archive, read-only) - Page 137

notyou · 06-04-2010, 03:24 AM

Quote:

Originally Posted by notyou

Hoping someone can help me with the Wired Magazine recipe, which the latest June 2010 issue seems to have broken.

Calibre is giving me the dreaded "AttributeError: 'NoneType' object has no attribute 'a'" error:

It's happening here:

File "c:\docume~1\darryl~1\locals~1\temp\calibre_0.6.54 _jujqqs_recipes\recipe0.py", line 79, in parse_index
url = 'http://www.wired.com' + divurl.a['href']

Looking at the source at http://www.wired.com/magazine I think the problem may be that Wired has added a new "Bonus Video" section which does not have a feature-header div, and so the URL cannot be parsed from there. Is there anyways to have Calibre skip feature sections that don't have headers?

Thanks!

For whatever reasons, the Wired Magazine recipe now seems to be working. It may be specific to the installation of Calibre on my work desktop (Windows XP), as I had no problems with the recipe on my MacBook Pro or a Windows XP laptop.

gambarini · 06-04-2010, 07:53 AM

Is there a way to NOT rescale an immage added (or rescale with better resolution/quality)?

kovidgoyal · 06-04-2010, 10:16 AM

Choose an output profile that has a screen size large enough to accomodate the image,like the iPad output profile.

gambarini · 06-04-2010, 10:19 AM

Code:

from calibre.web.feeds.news import BasicNewsRecipe
class LaStampaParseIndex(BasicNewsRecipe):

 title                 = u'Debug Parse Index'
 cover_url             = 'http://www.lastampa.it/edicola/PDF/1.pdf'
 remove_javascript     = True
 no_stylesheets        = True


        
 def nz_parse_section(self, url):

            def get_article_url(self, article):
              link = article.get('links')
              print link
              if link:
               return link[0]['href']
            soup  = self.index_to_soup(url)
            head  = soup.findAll('div',attrs= {'class': 'entry'})
            descr = soup.findAll('div',attrs= {'class': 'feedEntryConteny'})
            dt    = soup.findAll('div',attrs= {'class': 'lastUpdated'})
            print head
            print descr
            print dt
            current_articles = []
#            a = head.find('a', href = True)
#            title       = self.tag_to_string(a)
#            url         = a.get('href', False)
#            description = self.tag_to_string(descr)
#            date        = self.tag_to_string(dt)
#            self.log('title ', title)
#            self.log('url ', url)
#            self.log('description ', description)
#            self.log('date ', date)
#            current_articles.append({'title': title, 'url': url, 'description':description, 'date':date})
            current_articles.append({'title': '', 'url':'', 'description':'', 'date':''})


            return current_articles
 keep_only_tags = [dict(attrs={'class':['boxocchiello2','titoloRub','titologir','catenaccio','sezione','articologirata']}),
                   dict(name='div', attrs={'id':'corpoarticolo'})
                  ]

 remove_tags = [dict(name='div', attrs={'id':'menutop'}),
                dict(name='div', attrs={'id':'fwnetblocco'}),
                dict(name='table', attrs={'id':'strumenti'}),
                dict(name='table', attrs={'id':'imgesterna'}),
                dict(name='a', attrs={'class':'linkblu'}),
                dict(name='a', attrs={'class':'link'}),
                dict(name='span', attrs={'class':['boxocchiello','boxocchiello2','sezione']})
               ]
 def parse_index(self):
            feeds = []
            for title, url in [(u'Politica', u'http://www.lastampa.it/redazione/cmssezioni/politica/rss_politica.xml'),
                               (u'Torino', u'http://rss.feedsportal.com/c/32418/f/466938/index.rss')
                              ]:
               print url
               articles = self.nz_parse_section(url)

               if articles:
                   feeds.append((title, articles))
            return feeds

I don't know why but the soup.findall don't find anything.
Probably it's the same problem that calibre find when parse itself the feed and don't put the correct values into title.

I don't understand why...
I am don't understand to use the normal method to parse the feeds (using get_article('links')) and override only the title.

Starson17 · 06-04-2010, 10:25 AM

Quote:

Originally Posted by gambarini

I don't know why but the soup.findall don't find anything.

I looked at the first article in your first feed. There is no div tag with class="entry". Why do you expect it to find something?

gambarini · 06-04-2010, 10:37 AM

Quote:

Originally Posted by Starson17

I looked at the first article in your first feed. There is no div tag with class="entry". Why do you expect it to find something?

oooops

Am i so newbie?!?!??!?

thanks a lot.
now that i am looking the correct source of the feed, i try to search 'title', 'description' and pubDate.

gambarini · 06-04-2010, 10:59 AM

Code:

<item>
<title><![CDATA[Alfano ai giudici: "Sciopero politico"]]></title>
<description><![CDATA[ROMA<BR>Alla vigilia della riunione del Comitato direttivo centrale dell'Anm, dove verranno fissati i tempi e le modalità dello sciopero indetto dal sindacato delle toghe contro la manovra economica del Governo, le tensioni non si placano. <BR><BR>La reazione el governo è affidata al Guardasigilli Alfano. «Lo sciopero dei magistrati è uno sciopero politico, il governo chiede ai magistrati un sacri ...(continua)]]></description>
<author><![CDATA[]]></author>
<category><![CDATA[POLITICA]]></category>
<pubDate><![CDATA[Fri, 4 Jun 2010 14:5:28 +0200]]></pubDate>
<link>http://www.lastampa.it/redazione/cmsSezioni/politica/201006articoli/55639girata.asp</link>
<enclosure url='http://www.lastampa.it/redazione/cmssezioni/politica/201006images/alfano01G.jpg' type='image/jpeg' />
		<image>
  			<url>http://www.lastampa.it/redazione/cmssezioni/politica/201006images/alfano01G.jpg</url> 
  			<title></title> 
  			<link></link> 
  			<width></width> 
  			<height></height> 
  		</image>

</item>

this is an example of item.

kidtwisted · 06-04-2010, 04:39 PM

Quote:

Originally Posted by Starson17

You're welcome and good luck. I prefer to help others figure out how to do it than to just write it. If you need help with pcper, let us know, and be sure to post your final results here so Kovid can add it to the code for use by others.

I have a couple more questions, I'm cleaning up the tweaktown.com output and ran into a problem. Using the keep_only_tags to isolate the article body then the remove_tags to pick out the bits I don't want works great for the 1st page but the tags removed come back on the 2nd page and the rest of the article.
The tag names are the same as the 1st page, not sure why they're not being removed after the 1st page.
tweaktown recipe code:

Spoiler:

Code:

class AdvancedUserRecipe1273795663(BasicNewsRecipe):
    title = u'TweakTown Latest Tech'
    description = 'TweakTown Latest Tech'
    __author__ = 'KidTwisted'
    publisher             = 'TweakTown'
    category              = 'PC Articles, Reviews and Guides'
    use_embedded_content   = False
    max_articles_per_feed = 2
    oldest_article = 7
    cover_url      = 'http://www.tweaktown.com/images/logo_white.gif'
    timefmt  = ' [%Y %b %d ]'
    no_stylesheets = True
    language = 'en'
    #recursion             = 10
    remove_javascript = True
    conversion_options = { 'linearize_tables' : True}
   # reverse_article_order = True

    html2lrf_options = [
                          '--comment', description
                        , '--category', category
                        , '--publisher', publisher
                        ]
    
    html2epub_options = 'publisher="' + publisher + '"\ncomments="' + description + '"\ntags="' + category + '"' 
	
    keep_only_tags = [dict(name='div', attrs={'id':['article']})]

    remove_tags = [ dict(name='html', attrs={'id':'facebook'})
				   ,dict(name='div', attrs={'class':'article-info clearfix'})
				   ,dict(name='select', attrs={'onchange':'location.href=this.options[this.selectedIndex].value'})
				   ,dict(name='div', attrs={'class':'price-grabber'})
				   ,dict(name=['h4'])]
    feeds =  [ (u'Articles Reviews', u'http://feeds.feedburner.com/TweaktownArticlesReviewsAndGuidesRss20?format=xml') ]
    
    def append_page(self, soup, appendtag, position):
        pager = soup.find('a',attrs={'class':'next'})
        if pager:
           nexturl = pager['href']
           soup2 = self.index_to_soup(nexturl)
           texttag = soup2.find('div', attrs={'id':'article'})
           for it in texttag.findAll(style=True):
               del it['style']
           newpos = len(texttag.contents)          
           self.append_page(soup2,texttag,newpos)
           texttag.extract()
           appendtag.insert(position,texttag)
        
    
    def preprocess_html(self, soup):
        mtag = '<meta http-equiv="Content-Language" content="en-US"/>\n<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>'
        soup.head.insert(0,mtag)    
        for item in soup.findAll(style=True):
            del item['style']
        self.append_page(soup, soup.body, 3)
        pager = soup.find('a',attrs={'class':'next'})
        if pager:
           pager.extract()        
        return soup

2nd question,
I've started the pcper.com recipe and managed to get the multi-page to work on it. the problem on this is after the last page of the article they add a link that takes you back to the home page under the same tag that the pages were scraped from. The links for the pages all start with "article.php?" after the last page the link changes to "content_home.php?".

So is there a way to make the soup only scrape the links that start with "article.php?"?

Thanks

Starson17 · 06-04-2010, 07:23 PM

Quote:

Originally Posted by kidtwisted

I have a couple more questions

Aren't recipes fun!

Quote:

Using the keep_only_tags to isolate the article body then the remove_tags to pick out the bits I don't want works great for the 1st page but the tags removed come back on the 2nd page and the rest of the article.
The tag names are the same as the 1st page, not sure why they're not being removed after the 1st page.

It's likely because of the order in which the various stages of the recipe are processed. I've certainly seen this. Once you get to the point where you are building your own pages from the soup (and that's what the multipage does) you don't get the expected behavior.

I believe the keep_only throws away the tags, during the initial page pull, but doesn't apply to the extra pages you are getting with the soup2 = self.index_to_soup(nexturl) step.

I've certainly seen this before. There are lots of solutions, in fact, your recipe already uses one - extract()- to remove a tag. Just find the tags and extract them.

I usually do this at the postprocess_html stage with something like this:

Code:

        for tag in soup.findAll('form', dict(attrs={'name':["comments_form"]})):
            tag.extract()
        for tag in soup.findAll('font', dict(attrs={'id':["cr-other-headlines"]})):
            tag.extract()

extract() removes the tag entirely from the original soup, leaving you with two independent soups. In your recipe, you want the extracted tag, but it also works to remove it from the original soup, just like remove_tags.

Quote:

2nd question,
I've started the pcper.com recipe and managed to get the multi-page to work on it. the problem on this is after the last page of the article they add a link that takes you back to the home page under the same tag that the pages were scraped from. The links for the pages all start with "article.php?" after the last page the link changes to "content_home.php?".

So is there a way to make the soup only scrape the links that start with "article.php?"?

Thanks

Hmmm. It sounds like you are saying that:

Code:

        pager = soup.find('a',attrs={'class':'next'})
        if pager:

the pager <a> tag on the last page has a href content_home.php? link? If so, why not test if the pager['href'] string contains the string 'article' instead of just if pager:? You can use .find see here.

rty · 06-04-2010, 11:17 PM

Quote:

Originally Posted by Krittika Goyal

Next release will include a recipe for psychology today

Thanks Krittika for the Psychology Today's recipe. I found that your recipe can't fetch entire article that spans more than one page. This article, http://www.psychologytoday.com/artic...ectations-trap, for example, spans 5 pages and your recipe could fetch only the first page. Can you help fix it?

I would love to see the recipe fetching the cover too just like what the recipe for Time magazine does.

Scot · 06-05-2010, 02:08 AM

Hey, wondering if you guys are still taking requests?
I'm getting my dad a kobo reader for fathers day. I'm also hoping to cancel his paper subscription to save him a few extra bucks each year. He's still going to want to read some news though. I was hoping someone here could massage these RSS feeds into an aesthetically pleasing manner for me, so the transition is easier on him.

Top Stories - http://rss.cbc.ca/lineup/topstories.xml
World - http://rss.cbc.ca/lineup/world.xml
National - http://rss.cbc.ca/lineup/canada.xml
Manitoba - http://rss.cbc.ca/lineup/canada-manitoba.xml
Politics - http://rss.cbc.ca/lineup/politics.xml
Tech & Science - http://rss.cbc.ca/lineup/technology.xml
Books - http://rss.cbc.ca/lineup/arts-books.xml
Movies - http://rss.cbc.ca/lineup/arts-film.xml
Winnipeg 7 day Forecast - http://text.www.weatheroffice.gc.ca/...ty/mb-38_e.xml

Everything except weather shows up fine, but has a bunch of unnecessary text(Like there is no need for a page index, or the header to click back to the index page... Than their's the calibre footer. I'm guessing that isnt removable though, since kovidgoyal deserves credit for putting out a free application) I also notice a lot of redundancy. Stories that show up in Top Stories re-appear in national. If there was a way to have calibre ignore duplicate stories if its already been added to the file in a previous section, that would be pretty nifty.

Also, is there a way to easily change the default cover image automatically when it creates the file? Id like to have a bit centered CBC logo, since thats where all the news is sourced.

Thanks. I know this is asking a lot, and its made even worse since i'm a new here and have not contributed anything myself.

taliesin1077 · 06-05-2010, 02:01 PM

Quote:

Originally Posted by kiklop74

New recipe for Gizmodo:

Anyone else having the problem that Gizmodo's feeds don't display the full content? I get the annoying (more) link. This would be an excellent resource, but I don't know if there's any way to work around it. It's likely a protection so anyone wanting the full articles HAS to go to their site?

Starson17 · 06-05-2010, 02:25 PM

Quote:

Originally Posted by taliesin1077

Anyone else having the problem that Gizmodo's feeds don't display the full content? I get the annoying (more) link. This would be an excellent resource, but I don't know if there's any way to work around it. It's likely a protection so anyone wanting the full articles HAS to go to their site?

I'm pretty sure kiklop will fix it soon. If not, someone else will. The site must have changed.

kiklop74 · 06-05-2010, 03:13 PM

This is fixed and it will be included in the next release of calibre

rty · 06-06-2010, 02:23 AM

Quote:

Originally Posted by Scot

Hey, wondering if you guys are still taking requests?
I'm getting my dad a kobo reader for fathers day. I'm also hoping to cancel his paper subscription to save him a few extra bucks each year. He's still going to want to read some news though. I was hoping someone here could massage these RSS feeds into an aesthetically pleasing manner for me, so the transition is easier on him.

Top Stories - http://rss.cbc.ca/lineup/topstories.xml
World - http://rss.cbc.ca/lineup/world.xml
National - http://rss.cbc.ca/lineup/canada.xml
Manitoba - http://rss.cbc.ca/lineup/canada-manitoba.xml
Politics - http://rss.cbc.ca/lineup/politics.xml
Tech & Science - http://rss.cbc.ca/lineup/technology.xml
Books - http://rss.cbc.ca/lineup/arts-books.xml
Movies - http://rss.cbc.ca/lineup/arts-film.xml
Winnipeg 7 day Forecast - http://text.www.weatheroffice.gc.ca/...ty/mb-38_e.xml

Everything except weather shows up fine, but has a bunch of unnecessary text(Like there is no need for a page index, or the header to click back to the index page... Than their's the calibre footer. I'm guessing that isnt removable though, since kovidgoyal deserves credit for putting out a free application) I also notice a lot of redundancy. Stories that show up in Top Stories re-appear in national. If there was a way to have calibre ignore duplicate stories if its already been added to the file in a previous section, that would be pretty nifty.

Also, is there a way to easily change the default cover image automatically when it creates the file? Id like to have a bit centered CBC logo, since thats where all the news is sourced.

Thanks. I know this is asking a lot, and its made even worse since i'm a new here and have not contributed anything myself.

If you don't mind giving my humble recipe a shot, attached is the zip file of the recipe.

Winnipeg weather is not from the same website so I guess it's not allowed to mix sources.

If you can point to me a large enough picture for the cover page, maybe I can help.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Custom column read ?	pchrist7	Calibre	2	10-04-2010 03:52 AM
Archive for custom screensavers	sleeplessdave	Amazon Kindle	1	07-07-2010 01:33 PM
How to back up preferences and custom recipes?	greenapple	Calibre	3	03-29-2010 06:08 AM
Donations for Custom Recipes	ddavtian	Calibre	5	01-23-2010 05:54 PM
Help understanding custom recipes	andersent	Calibre	0	12-17-2009 03:37 PM

06-04-2010, 07:53 AM	#2042
gambarini Connoisseur Posts: 98 Karma: 22 Join Date: Mar 2010 Device: IRiver Story, Ipod Touch, Android SmartPhone	Is there a way to NOT rescale an immage added (or rescale with better resolution/quality)?

06-04-2010, 10:16 AM	#2043
kovidgoyal creator of calibre Posts: 45,626 Karma: 28549046 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Choose an output profile that has a screen size large enough to accomodate the image,like the iPad output profile.

06-05-2010, 02:08 AM	#2051
Scot Member Posts: 11 Karma: 10 Join Date: Jun 2010 Device: PRS-505, TouchPad, iPad2	Hey, wondering if you guys are still taking requests? I'm getting my dad a kobo reader for fathers day. I'm also hoping to cancel his paper subscription to save him a few extra bucks each year. He's still going to want to read some news though. I was hoping someone here could massage these RSS feeds into an aesthetically pleasing manner for me, so the transition is easier on him. Top Stories - http://rss.cbc.ca/lineup/topstories.xml World - http://rss.cbc.ca/lineup/world.xml National - http://rss.cbc.ca/lineup/canada.xml Manitoba - http://rss.cbc.ca/lineup/canada-manitoba.xml Politics - http://rss.cbc.ca/lineup/politics.xml Tech & Science - http://rss.cbc.ca/lineup/technology.xml Books - http://rss.cbc.ca/lineup/arts-books.xml Movies - http://rss.cbc.ca/lineup/arts-film.xml Winnipeg 7 day Forecast - http://text.www.weatheroffice.gc.ca/...ty/mb-38_e.xml Everything except weather shows up fine, but has a bunch of unnecessary text(Like there is no need for a page index, or the header to click back to the index page... Than their's the calibre footer. I'm guessing that isnt removable though, since kovidgoyal deserves credit for putting out a free application) I also notice a lot of redundancy. Stories that show up in Top Stories re-appear in national. If there was a way to have calibre ignore duplicate stories if its already been added to the file in a previous section, that would be pretty nifty. Also, is there a way to easily change the default cover image automatically when it creates the file? Id like to have a bit centered CBC logo, since thats where all the news is sourced. Thanks. I know this is asking a lot, and its made even worse since i'm a new here and have not contributed anything myself.

06-05-2010, 03:13 PM	#2054
kiklop74 Guru Posts: 800 Karma: 194644 Join Date: Dec 2007 Location: Argentina Device: Kindle Voyage	This is fixed and it will be included in the next release of calibre

Advert

Advert