Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Closed Thread
 
Thread Tools Search this Thread
Old 06-04-2010, 02:24 AM   #2041
notyou
Connoisseur
notyou can extract oil from cheesenotyou can extract oil from cheesenotyou can extract oil from cheesenotyou can extract oil from cheesenotyou can extract oil from cheesenotyou can extract oil from cheesenotyou can extract oil from cheesenotyou can extract oil from cheesenotyou can extract oil from cheese
 
Posts: 52
Karma: 1140
Join Date: Apr 2010
Device: Kindle / Palm Pre / iPhone
Wired fixed itself

Quote:
Originally Posted by notyou View Post
Hoping someone can help me with the Wired Magazine recipe, which the latest June 2010 issue seems to have broken.

Calibre is giving me the dreaded "AttributeError: 'NoneType' object has no attribute 'a'" error:

It's happening here:

File "c:\docume~1\darryl~1\locals~1\temp\calibre_0.6.54 _jujqqs_recipes\recipe0.py", line 79, in parse_index
url = 'http://www.wired.com' + divurl.a['href']

Looking at the source at http://www.wired.com/magazine I think the problem may be that Wired has added a new "Bonus Video" section which does not have a feature-header div, and so the URL cannot be parsed from there. Is there anyways to have Calibre skip feature sections that don't have headers?

Thanks!
For whatever reasons, the Wired Magazine recipe now seems to be working. It may be specific to the installation of Calibre on my work desktop (Windows XP), as I had no problems with the recipe on my MacBook Pro or a Windows XP laptop.
notyou is offline  
Old 06-04-2010, 06:53 AM   #2042
gambarini
Connoisseur
gambarini began at the beginning.
 
Posts: 98
Karma: 22
Join Date: Mar 2010
Device: IRiver Story, Ipod Touch, Android SmartPhone
Is there a way to NOT rescale an immage added (or rescale with better resolution/quality)?
gambarini is offline  
 
Enthusiast
Old 06-04-2010, 09:16 AM   #2043
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 25,355
Karma: 4961459
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Choose an output profile that has a screen size large enough to accomodate the image,like the iPad output profile.
kovidgoyal is online now  
Old 06-04-2010, 09:19 AM   #2044
gambarini
Connoisseur
gambarini began at the beginning.
 
Posts: 98
Karma: 22
Join Date: Mar 2010
Device: IRiver Story, Ipod Touch, Android SmartPhone
Code:
from calibre.web.feeds.news import BasicNewsRecipe
class LaStampaParseIndex(BasicNewsRecipe):

 title                 = u'Debug Parse Index'
 cover_url             = 'http://www.lastampa.it/edicola/PDF/1.pdf'
 remove_javascript     = True
 no_stylesheets        = True


        
 def nz_parse_section(self, url):

            def get_article_url(self, article):
              link = article.get('links')
              print link
              if link:
               return link[0]['href']
            soup  = self.index_to_soup(url)
            head  = soup.findAll('div',attrs= {'class': 'entry'})
            descr = soup.findAll('div',attrs= {'class': 'feedEntryConteny'})
            dt    = soup.findAll('div',attrs= {'class': 'lastUpdated'})
            print head
            print descr
            print dt
            current_articles = []
#            a = head.find('a', href = True)
#            title       = self.tag_to_string(a)
#            url         = a.get('href', False)
#            description = self.tag_to_string(descr)
#            date        = self.tag_to_string(dt)
#            self.log('title ', title)
#            self.log('url ', url)
#            self.log('description ', description)
#            self.log('date ', date)
#            current_articles.append({'title': title, 'url': url, 'description':description, 'date':date})
            current_articles.append({'title': '', 'url':'', 'description':'', 'date':''})


            return current_articles
 keep_only_tags = [dict(attrs={'class':['boxocchiello2','titoloRub','titologir','catenaccio','sezione','articologirata']}),
                   dict(name='div', attrs={'id':'corpoarticolo'})
                  ]

 remove_tags = [dict(name='div', attrs={'id':'menutop'}),
                dict(name='div', attrs={'id':'fwnetblocco'}),
                dict(name='table', attrs={'id':'strumenti'}),
                dict(name='table', attrs={'id':'imgesterna'}),
                dict(name='a', attrs={'class':'linkblu'}),
                dict(name='a', attrs={'class':'link'}),
                dict(name='span', attrs={'class':['boxocchiello','boxocchiello2','sezione']})
               ]
 def parse_index(self):
            feeds = []
            for title, url in [(u'Politica', u'http://www.lastampa.it/redazione/cmssezioni/politica/rss_politica.xml'),
                               (u'Torino', u'http://rss.feedsportal.com/c/32418/f/466938/index.rss')
                              ]:
               print url
               articles = self.nz_parse_section(url)

               if articles:
                   feeds.append((title, articles))
            return feeds
I don't know why but the soup.findall don't find anything.
Probably it's the same problem that calibre find when parse itself the feed and don't put the correct values into title.

I don't understand why...
I am don't understand to use the normal method to parse the feeds (using get_article('links')) and override only the title.
gambarini is offline  
Old 06-04-2010, 09:25 AM   #2045
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by gambarini View Post
I don't know why but the soup.findall don't find anything.
I looked at the first article in your first feed. There is no div tag with class="entry". Why do you expect it to find something?
Starson17 is offline  
Old 06-04-2010, 09:37 AM   #2046
gambarini
Connoisseur
gambarini began at the beginning.
 
Posts: 98
Karma: 22
Join Date: Mar 2010
Device: IRiver Story, Ipod Touch, Android SmartPhone
Quote:
Originally Posted by Starson17 View Post
I looked at the first article in your first feed. There is no div tag with class="entry". Why do you expect it to find something?
oooops

Am i so newbie?!?!??!?


thanks a lot.
now that i am looking the correct source of the feed, i try to search 'title', 'description' and pubDate.

Last edited by gambarini; 06-04-2010 at 09:43 AM.
gambarini is offline  
Old 06-04-2010, 09:59 AM   #2047
gambarini
Connoisseur
gambarini began at the beginning.
 
Posts: 98
Karma: 22
Join Date: Mar 2010
Device: IRiver Story, Ipod Touch, Android SmartPhone
Code:
<item>
<title><![CDATA[Alfano ai giudici: "Sciopero politico"]]></title>
<description><![CDATA[ROMA<BR>Alla vigilia della riunione del Comitato direttivo centrale dell'Anm, dove verranno fissati i tempi e le modalità dello sciopero indetto dal sindacato delle toghe contro la manovra economica del Governo, le tensioni non si placano. <BR><BR>La reazione el governo è affidata al Guardasigilli Alfano. «Lo sciopero dei magistrati è uno sciopero politico, il governo chiede ai magistrati un sacri ...(continua)]]></description>
<author><![CDATA[]]></author>
<category><![CDATA[POLITICA]]></category>
<pubDate><![CDATA[Fri, 4 Jun 2010 14:5:28 +0200]]></pubDate>
<link>http://www.lastampa.it/redazione/cmsSezioni/politica/201006articoli/55639girata.asp</link>
<enclosure url='http://www.lastampa.it/redazione/cmssezioni/politica/201006images/alfano01G.jpg' type='image/jpeg' />
		<image>
  			<url>http://www.lastampa.it/redazione/cmssezioni/politica/201006images/alfano01G.jpg</url> 
  			<title></title> 
  			<link></link> 
  			<width></width> 
  			<height></height> 
  		</image>

</item>
this is an example of item.
gambarini is offline  
Old 06-04-2010, 03:39 PM   #2048
kidtwisted
Member
kidtwisted began at the beginning.
 
kidtwisted's Avatar
 
Posts: 16
Karma: 10
Join Date: May 2010
Location: Southern California
Device: JetBook-Lite
Quote:
Originally Posted by Starson17 View Post
You're welcome and good luck. I prefer to help others figure out how to do it than to just write it. If you need help with pcper, let us know, and be sure to post your final results here so Kovid can add it to the code for use by others.
I have a couple more questions, I'm cleaning up the tweaktown.com output and ran into a problem. Using the keep_only_tags to isolate the article body then the remove_tags to pick out the bits I don't want works great for the 1st page but the tags removed come back on the 2nd page and the rest of the article.
The tag names are the same as the 1st page, not sure why they're not being removed after the 1st page.
tweaktown recipe code:
Spoiler:
Code:
class AdvancedUserRecipe1273795663(BasicNewsRecipe):
    title = u'TweakTown Latest Tech'
    description = 'TweakTown Latest Tech'
    __author__ = 'KidTwisted'
    publisher             = 'TweakTown'
    category              = 'PC Articles, Reviews and Guides'
    use_embedded_content   = False
    max_articles_per_feed = 2
    oldest_article = 7
    cover_url      = 'http://www.tweaktown.com/images/logo_white.gif'
    timefmt  = ' [%Y %b %d ]'
    no_stylesheets = True
    language = 'en'
    #recursion             = 10
    remove_javascript = True
    conversion_options = { 'linearize_tables' : True}
   # reverse_article_order = True

    html2lrf_options = [
                          '--comment', description
                        , '--category', category
                        , '--publisher', publisher
                        ]
    
    html2epub_options = 'publisher="' + publisher + '"\ncomments="' + description + '"\ntags="' + category + '"' 
	
    keep_only_tags = [dict(name='div', attrs={'id':['article']})]

    remove_tags = [ dict(name='html', attrs={'id':'facebook'})
				   ,dict(name='div', attrs={'class':'article-info clearfix'})
				   ,dict(name='select', attrs={'onchange':'location.href=this.options[this.selectedIndex].value'})
				   ,dict(name='div', attrs={'class':'price-grabber'})
				   ,dict(name=['h4'])]
    feeds =  [ (u'Articles Reviews', u'http://feeds.feedburner.com/TweaktownArticlesReviewsAndGuidesRss20?format=xml') ]
    
    def append_page(self, soup, appendtag, position):
        pager = soup.find('a',attrs={'class':'next'})
        if pager:
           nexturl = pager['href']
           soup2 = self.index_to_soup(nexturl)
           texttag = soup2.find('div', attrs={'id':'article'})
           for it in texttag.findAll(style=True):
               del it['style']
           newpos = len(texttag.contents)          
           self.append_page(soup2,texttag,newpos)
           texttag.extract()
           appendtag.insert(position,texttag)
        
    
    def preprocess_html(self, soup):
        mtag = '<meta http-equiv="Content-Language" content="en-US"/>\n<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>'
        soup.head.insert(0,mtag)    
        for item in soup.findAll(style=True):
            del item['style']
        self.append_page(soup, soup.body, 3)
        pager = soup.find('a',attrs={'class':'next'})
        if pager:
           pager.extract()        
        return soup



2nd question,
I've started the pcper.com recipe and managed to get the multi-page to work on it. the problem on this is after the last page of the article they add a link that takes you back to the home page under the same tag that the pages were scraped from. The links for the pages all start with "article.php?" after the last page the link changes to "content_home.php?".

So is there a way to make the soup only scrape the links that start with "article.php?"?

Thanks
kidtwisted is offline  
Old 06-04-2010, 06:23 PM   #2049
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by kidtwisted View Post
I have a couple more questions
Aren't recipes fun!

Quote:
Using the keep_only_tags to isolate the article body then the remove_tags to pick out the bits I don't want works great for the 1st page but the tags removed come back on the 2nd page and the rest of the article.
The tag names are the same as the 1st page, not sure why they're not being removed after the 1st page.
It's likely because of the order in which the various stages of the recipe are processed. I've certainly seen this. Once you get to the point where you are building your own pages from the soup (and that's what the multipage does) you don't get the expected behavior.

I believe the keep_only throws away the tags, during the initial page pull, but doesn't apply to the extra pages you are getting with the soup2 = self.index_to_soup(nexturl) step.

I've certainly seen this before. There are lots of solutions, in fact, your recipe already uses one - extract()- to remove a tag. Just find the tags and extract them.

I usually do this at the postprocess_html stage with something like this:
Code:
        for tag in soup.findAll('form', dict(attrs={'name':["comments_form"]})):
            tag.extract()
        for tag in soup.findAll('font', dict(attrs={'id':["cr-other-headlines"]})):
            tag.extract()
extract() removes the tag entirely from the original soup, leaving you with two independent soups. In your recipe, you want the extracted tag, but it also works to remove it from the original soup, just like remove_tags.

Quote:
2nd question,
I've started the pcper.com recipe and managed to get the multi-page to work on it. the problem on this is after the last page of the article they add a link that takes you back to the home page under the same tag that the pages were scraped from. The links for the pages all start with "article.php?" after the last page the link changes to "content_home.php?".

So is there a way to make the soup only scrape the links that start with "article.php?"?

Thanks
Hmmm. It sounds like you are saying that:
Code:
        pager = soup.find('a',attrs={'class':'next'})
        if pager:
the pager <a> tag on the last page has a href content_home.php? link? If so, why not test if the pager['href'] string contains the string 'article' instead of just if pager:? You can use .find see here.
Starson17 is offline  
Old 06-04-2010, 10:17 PM   #2050
rty
Zealot
rty got an A in P-Chem.rty got an A in P-Chem.rty got an A in P-Chem.rty got an A in P-Chem.rty got an A in P-Chem.rty got an A in P-Chem.rty got an A in P-Chem.rty got an A in P-Chem.rty got an A in P-Chem.rty got an A in P-Chem.rty got an A in P-Chem.
 
Posts: 105
Karma: 6066
Join Date: Apr 2010
Location: Travelling Nomad in Asia
Device: iPad 2, Nook 3G, Kindle DXG, Eken M001
Quote:
Originally Posted by Krittika Goyal View Post
Next release will include a recipe for psychology today
Thanks Krittika for the Psychology Today's recipe. I found that your recipe can't fetch entire article that spans more than one page. This article, http://www.psychologytoday.com/artic...ectations-trap, for example, spans 5 pages and your recipe could fetch only the first page. Can you help fix it?

I would love to see the recipe fetching the cover too just like what the recipe for Time magazine does.
rty is offline  
Old 06-05-2010, 01:08 AM   #2051
Scot
Member
Scot began at the beginning.
 
Posts: 11
Karma: 10
Join Date: Jun 2010
Device: PRS-505, TouchPad, iPad2
Hey, wondering if you guys are still taking requests?
I'm getting my dad a kobo reader for fathers day. I'm also hoping to cancel his paper subscription to save him a few extra bucks each year. He's still going to want to read some news though. I was hoping someone here could massage these RSS feeds into an aesthetically pleasing manner for me, so the transition is easier on him.

Top Stories - http://rss.cbc.ca/lineup/topstories.xml
World - http://rss.cbc.ca/lineup/world.xml
National - http://rss.cbc.ca/lineup/canada.xml
Manitoba - http://rss.cbc.ca/lineup/canada-manitoba.xml
Politics - http://rss.cbc.ca/lineup/politics.xml
Tech & Science - http://rss.cbc.ca/lineup/technology.xml
Books - http://rss.cbc.ca/lineup/arts-books.xml
Movies - http://rss.cbc.ca/lineup/arts-film.xml
Winnipeg 7 day Forecast - http://text.www.weatheroffice.gc.ca/...ty/mb-38_e.xml

Everything except weather shows up fine, but has a bunch of unnecessary text(Like there is no need for a page index, or the header to click back to the index page... Than their's the calibre footer. I'm guessing that isnt removable though, since kovidgoyal deserves credit for putting out a free application) I also notice a lot of redundancy. Stories that show up in Top Stories re-appear in national. If there was a way to have calibre ignore duplicate stories if its already been added to the file in a previous section, that would be pretty nifty.

Also, is there a way to easily change the default cover image automatically when it creates the file? Id like to have a bit centered CBC logo, since thats where all the news is sourced.

Thanks. I know this is asking a lot, and its made even worse since i'm a new here and have not contributed anything myself.
Scot is offline  
Old 06-05-2010, 01:01 PM   #2052
taliesin1077
Junior Member
taliesin1077 began at the beginning.
 
Posts: 3
Karma: 10
Join Date: Jun 2010
Device: Kindle
Quote:
Originally Posted by kiklop74 View Post
New recipe for Gizmodo:
Anyone else having the problem that Gizmodo's feeds don't display the full content? I get the annoying (more) link. This would be an excellent resource, but I don't know if there's any way to work around it. It's likely a protection so anyone wanting the full articles HAS to go to their site?
taliesin1077 is offline  
Old 06-05-2010, 01:25 PM   #2053
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by taliesin1077 View Post
Anyone else having the problem that Gizmodo's feeds don't display the full content? I get the annoying (more) link. This would be an excellent resource, but I don't know if there's any way to work around it. It's likely a protection so anyone wanting the full articles HAS to go to their site?
I'm pretty sure kiklop will fix it soon. If not, someone else will. The site must have changed.
Starson17 is offline  
Old 06-05-2010, 02:13 PM   #2054
kiklop74
Guru
kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.
 
kiklop74's Avatar
 
Posts: 779
Karma: 194642
Join Date: Dec 2007
Location: Argentina
Device: Kindle PaperWhite, Motorola Xoom
This is fixed and it will be included in the next release of calibre
kiklop74 is offline  
Old 06-06-2010, 01:23 AM   #2055
rty
Zealot
rty got an A in P-Chem.rty got an A in P-Chem.rty got an A in P-Chem.rty got an A in P-Chem.rty got an A in P-Chem.rty got an A in P-Chem.rty got an A in P-Chem.rty got an A in P-Chem.rty got an A in P-Chem.rty got an A in P-Chem.rty got an A in P-Chem.
 
Posts: 105
Karma: 6066
Join Date: Apr 2010
Location: Travelling Nomad in Asia
Device: iPad 2, Nook 3G, Kindle DXG, Eken M001
Quote:
Originally Posted by Scot View Post
Hey, wondering if you guys are still taking requests?
I'm getting my dad a kobo reader for fathers day. I'm also hoping to cancel his paper subscription to save him a few extra bucks each year. He's still going to want to read some news though. I was hoping someone here could massage these RSS feeds into an aesthetically pleasing manner for me, so the transition is easier on him.

Top Stories - http://rss.cbc.ca/lineup/topstories.xml
World - http://rss.cbc.ca/lineup/world.xml
National - http://rss.cbc.ca/lineup/canada.xml
Manitoba - http://rss.cbc.ca/lineup/canada-manitoba.xml
Politics - http://rss.cbc.ca/lineup/politics.xml
Tech & Science - http://rss.cbc.ca/lineup/technology.xml
Books - http://rss.cbc.ca/lineup/arts-books.xml
Movies - http://rss.cbc.ca/lineup/arts-film.xml
Winnipeg 7 day Forecast - http://text.www.weatheroffice.gc.ca/...ty/mb-38_e.xml

Everything except weather shows up fine, but has a bunch of unnecessary text(Like there is no need for a page index, or the header to click back to the index page... Than their's the calibre footer. I'm guessing that isnt removable though, since kovidgoyal deserves credit for putting out a free application) I also notice a lot of redundancy. Stories that show up in Top Stories re-appear in national. If there was a way to have calibre ignore duplicate stories if its already been added to the file in a previous section, that would be pretty nifty.

Also, is there a way to easily change the default cover image automatically when it creates the file? Id like to have a bit centered CBC logo, since thats where all the news is sourced.

Thanks. I know this is asking a lot, and its made even worse since i'm a new here and have not contributed anything myself.
If you don't mind giving my humble recipe a shot, attached is the zip file of the recipe.

Winnipeg weather is not from the same website so I guess it's not allowed to mix sources.

If you can point to me a large enough picture for the cover page, maybe I can help.
Attached Files
File Type: zip CBC News Canada.zip (534 Bytes, 55 views)
rty is offline  
Closed Thread

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Custom column read ? pchrist7 Calibre 2 10-04-2010 02:52 AM
Archive for custom screensavers sleeplessdave Amazon Kindle 1 07-07-2010 12:33 PM
How to back up preferences and custom recipes? greenapple Calibre 3 03-29-2010 05:08 AM
Donations for Custom Recipes ddavtian Calibre 5 01-23-2010 04:54 PM
Help understanding custom recipes andersent Calibre 0 12-17-2009 02:37 PM


All times are GMT -4. The time now is 12:42 AM.


MobileRead.com is a privately owned, operated and funded community.