Custom recipes (archive, read-only) - Page 135

kiklop74 · 05-31-2010, 07:34 PM

New recipe for Bosnian portal sarajevo-x.com:

RedRoverJ · 05-31-2010, 10:30 PM

Are there any recipes specifically for the FIFA 2010 World Cup feeds? A couple on fifa.com that would be nice are:

Latest News:
http://www.fifa.com/rss/index.xml

2010 FIFA World Cup South Africa:
http://www.fifa.com/worldcup/news/rss.xml

bhandarisaurabh · 05-31-2010, 10:54 PM

can anyone help me with the recipe for this magazine
http://www.foodprocessing360.com/ind...ate=12/05/2009

23n · 05-31-2010, 11:53 PM

Quote:

Originally Posted by Starson17

This web page reproduces the RSS feed (at least for the first 3 feeds I checked.) Calibre has a builtin recipe for The Register RSS feed. Why don't you look at that one first to see if it meets your needs.

Ya, I looked at it but it doesn't give the full week of stories (trust me, I need the full week at times). Also, I would really appreciate the days being headers in the TOC. I've developed a really strong preference of reading that site by date rather than section. As it is, I'm using the recipe that comes with Calibre but have it only getting the week.html page. And none of the articles from the reghardware.com site are retrieved which is the other issue I identified.

I am pretty wimpy when it comes to python. I can sed and perl pretty well, can do a bit with awk, but python just makes my brain hurt. That's part of why I like sitescooper so much and the simplicity of their .site files.

Thanks for your suggestion, though!

Cheers!

23n · 06-01-2010, 12:02 AM

Quote:

Originally Posted by Starson17

If you want to try it yourself, this needs parse_index. Look here.

Thanks for the suggestion. I haven't seen this page before, so I'll look into it. I was trying to admit I suck at python and was bowing to the practiced and superior abilities that I've seen on this forum. I'm guessing that four hours of my time trying to get this done would be about 15 minutes (or less!) of effect from some of the people of this forum. I will eventually try to do it myself but I thought that if someone was able to quickly achieve what I'm asking, it would save me (literally) hours of frustration and garner heaps of my appreciation!

Cheers!

Newby · 06-01-2010, 02:09 AM

Quote:

Originally Posted by kiklop74

New recipe for Bosnian portal sarajevo-x.com:

Thank you very much!

gambarini · 06-01-2010, 09:51 AM

http://bugs.calibre-ebook.com/wiki/recipeGuide_advanced

in this link a can find an example of parse_index, and is a good method to create a feed, a complete list of article.
So, now i try to use the parse index in two different way:

-) to override only the title (because lack in the feed, and because the other are correct (description, url, date)).
-) to create a complete feed with all real first page of newspaper.

the second way now is clear, but the first actualy not at all.

Starson17 · 06-01-2010, 09:56 AM

Quote:

Originally Posted by gambarini

thanks in advance

Clearly, I wasn't clear that what you wrote wasn't clear to me. To clear things up, I have to ask you to be more clear. I'm sure that it is now clear that your thanks are premature.

(To rephrase the above: Why don't you repost your questions, in greater detail, if you still have any. I really couldn't figure out what help you were asking for.)

Edit: I see you did that .. while I was writing my comment.

Starson17 · 06-01-2010, 10:20 AM

Quote:

Originally Posted by gambarini

http://bugs.calibre-ebook.com/wiki/recipeGuide_advanced

in this link a can find an example of parse_index, and is a good method to create a feed, a complete list of article.
So, now i try to use the parse index in two different way:

-) to override only the title (because lack in the feed, and because the other are correct (description, url, date)).
-) to create a complete feed with all real first page of newspaper.

the second way now is clear, but the first actualy not at all.

So what have you tried? The page you reference explains how parse_index works. You create your own set of feeds. Each feed has a title and a set of articles. The set of feeds is created in the line:
feeds.append((title, articles))
of parse_index. The "title" there is the feed title.

The articles for each feed are created in nz_parse_section of the example in this line:

current_articles.append({'title': title, 'url': url, 'description':'', 'date':''})

The "title" there is the article title.

It appears you want to control the article titles, not the feed title. I'd do it this way:

First, I'd use parse_index to process each RSS feed I want (you may only need one). Parse_index will treat each RSS feed page as a web page. You can grab what you want from that page using BeautifulSoup. I'd use a modified version of nz_parse_section to find each {'title': title, 'url': url, 'description':'', 'date':''} for each article on the page being processed. As I grab that data for each article, I'd test the title to see if it's what I want to appear. You said they are usually OK. If they aren't OK, you'll need to either create a title, if you can, or go to the URL and get a title from that page (again, BeautifulSoup is used to grab the info you want). Once you are happy with the data for the article, you append it to the current_articles list.

When you're done with the page, it returns to parse_index and your titles will be as you want them.

It sounds like a lot of trouble, but I don't see any other way to do it.

Starson17 · 06-01-2010, 10:30 AM

Quote:

Originally Posted by 23n

Thanks for the suggestion. I haven't seen this page before, so I'll look into it. I was trying to admit I suck at python and was bowing to the practiced and superior abilities that I've seen on this forum. I'm guessing that four hours of my time trying to get this done would be about 15 minutes (or less!) of effect from some of the people of this forum. I will eventually try to do it myself but I thought that if someone was able to quickly achieve what I'm asking, it would save me (literally) hours of frustration and garner heaps of my appreciation!

Cheers!

Recipes that require parsing a web page for feed info typically take more time than recipes based on an RSS feed. With the RSS feed, you just have to clean up the output. With a web page, you first have to write the code to parse the page, and that just gets you to the same point you would start at with an RSS feed. You looked like someone who might have the skill and desire to DIY, so I thought I'd steer you towards where you could find the info you'd need if you wanted to try.

TonytheBookworm · 06-01-2010, 05:55 PM

Quote:

Originally Posted by Starson17

Fark has an RSS feed, and I looked at it. It seems to have a one sentence description of an article on another site and a slew of comments. Do you just want the one sentence from Fark with the link, or do you want the comments? The content of the linked articles is probably too variable to easily add, as it comes from dozens of different sources, each with a different page structure. You'd get lots of junk with each one.

I didn't realize they had a RSS feed. I will look at that. I really just wanted what you mentioned a list of links with the fark comment..... like for instance

Newsweek: Bizarre - Guy stung in rear by numerous bees ends up harboring a honeycomb in his rectum.

and then it link me to some story.... I'll look at the RSS feed that is probably what i need anyway. thanks for the help though.

kidtwisted · 06-01-2010, 09:59 PM

Quote:

Originally Posted by Starson17

You need to use multipage code. Here's an example from the adventuregamers.recipe builtin:

Code:

    def append_page(self, soup, appendtag, position):
        pager = soup.find('div',attrs={'class':'toolbar_fat_next'})
        if pager:
           nexturl = self.INDEX + pager.a['href']
           soup2 = self.index_to_soup(nexturl)
           texttag = soup2.find('div', attrs={'class':'bodytext'})
           for it in texttag.findAll(style=True):
               del it['style']
           newpos = len(texttag.contents)          
           self.append_page(soup2,texttag,newpos)
           texttag.extract()
           appendtag.insert(position,texttag)
        
    
    def preprocess_html(self, soup):
        mtag = '<meta http-equiv="Content-Language" content="en-US"/>\n<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>'
        soup.head.insert(0,mtag)    
        for item in soup.findAll(style=True):
            del item['style']
        self.append_page(soup, soup.body, 3)
        pager = soup.find('div',attrs={'class':'toolbar_fat'})
        if pager:
           pager.extract()        
        return soup

append_page recursively looks for the next page tag ('div',attrs={'class':'toolbar_fat_next'}), gets the text and inserts it into the soup at the point where the tag was found until all pages have been inserted.

preprocess_html uses append_page to modify the html. You'll need to look for the next page tag on your site and adjust accordingly. This should get you started.

Do your testing with -vv and --test
as in:
ebook-convert pcper.recipe pcper --test -vv> pcper.txt

Hey Starson17,
I have 2 site that I'm tiring to get the multi-page code working on, pcper.com and tweaktown.com. Both these sites have similar layouts thou tweaktown.com source code seems a bit better to learn with, so I've been workin with that one.

I'm kinda stuck, when I add the append_page code the test html only contains the feed description and date, with out it I get the 1st page so I'm screwing it up somewhere.

here's what I have for tweaktown.com:

Code:

class AdvancedUserRecipe1273795663(BasicNewsRecipe):
    title = u'TweakTown Latest Tech'
    description = 'TweakTown Latest Tech'
    __author__ = 'KidTwisted'
    publisher             = 'TweakTown'
    category              = 'PC Articles, Reviews and Guides'
    use_embedded_content   = False
    max_articles_per_feed = 1
    oldest_article = 7
    timefmt  = ' [%Y %b %d ]'
    no_stylesheets = True
    language = 'en'
    #recursion             = 10
    remove_javascript = True
    conversion_options = { 'linearize_tables' : True}
   # reverse_article_order = True
    #INDEX                 = u'http://www.tweaktown.com'

    html2lrf_options = [
                          '--comment', description
                        , '--category', category
                        , '--publisher', publisher
                        ]
    
    html2epub_options = 'publisher="' + publisher + '"\ncomments="' + description + '"\ntags="' + category + '"' 
	
    keep_only_tags = [dict(name='div', attrs={'id':['article']})]

    feeds =  [ (u'Articles Reviews', u'http://feeds.feedburner.com/TweaktownArticlesReviewsAndGuidesRss20?format=xml') ]

    def get_article_url(self, article):
        return article.get('guid',  None)
    
    def append_page(self, soup, appendtag, position):
        pager = soup.find('a',attrs={'class':'next'})
        if pager:
           nexturl = pager.a['href']
           soup2 = self.index_to_soup(nexturl)
           texttag = soup2.find('div', attrs={'id':'article'})
           for it in texttag.findAll(style=True):
               del it['style']
           newpos = len(texttag.contents)          
           self.append_page(soup2,texttag,newpos)
           texttag.extract()
           appendtag.insert(position,texttag)
        
    
    def preprocess_html(self, soup):
        mtag = '<meta http-equiv="Content-Language" content="en-US"/>\n<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>'
        soup.head.insert(0,mtag)    
        for item in soup.findAll(style=True):
            del item['style']
        self.append_page(soup, soup.body, 3)
        pager = soup.find('a',attrs={'class':'next'})
        if pager:
           pager.extract()        
        return soup

Could you or someone in the know take a look at it to see what I'm doing wrong. I commented out "INDEX" because the link for the next page is a complete link, any help on this would be great.

kidtwisted · 06-01-2010, 11:17 PM

Just a side thought to my previous post, both of those site use Article index drop down boxes that contain links to all the pages of the article.
example source code from pcper.com:

Code:

<form method="post" action="/article.php">
		  <b>Review Index:</b><br>
		  <select style="font-size: 75%;" onchange="location.href=form.url.options[form.url.selectedIndex].value" name="url">
		 	 <option select=""> - Select - </option>

       <option value="article.php?aid=926&amp;type=expert&amp;pid=1" select="">A complete lineup</option>
  
       <option value="article.php?aid=926&amp;type=expert&amp;pid=2" select="">FirePro V7800 and V4800 Cards</option>
  
       <option value="article.php?aid=926&amp;type=expert&amp;pid=3" select="">Testing Methodology, System Setup and CineBench 11/10</option>
  
       <option value="article.php?aid=926&amp;type=expert&amp;pid=4" select="">SPECviewperf 10</option>
  
       <option value="article.php?aid=926&amp;type=expert&amp;pid=5" select="">SPECviewperf 10 - Multisample Testing</option>
  
       <option value="article.php?aid=926&amp;type=expert&amp;pid=6" select="">SPECviewperf 10 - Multithreaded testing</option>
  
       <option value="article.php?aid=926&amp;type=expert&amp;pid=7" select="">3DMark Vantage</option>
  
       <option value="article.php?aid=926&amp;type=expert&amp;pid=8" select="">Power Consumption and Conclusions</option>
  
      </select>
		 </form>

How could I build my article from there instead of the next page button?

gambarini · 06-02-2010, 03:07 AM

Quote:

Originally Posted by Starson17

So what have you tried? The page you reference explains how parse_index works. You create your own set of feeds. Each feed has a title and a set of articles. The set of feeds is created in the line:
feeds.append((title, articles))
of parse_index. The "title" there is the feed title.

The articles for each feed are created in nz_parse_section of the example in this line:

current_articles.append({'title': title, 'url': url, 'description':'', 'date':''})

The "title" there is the article title.

It appears you want to control the article titles, not the feed title. I'd do it this way:

First, I'd use parse_index to process each RSS feed I want (you may only need one). Parse_index will treat each RSS feed page as a web page. You can grab what you want from that page using BeautifulSoup. I'd use a modified version of nz_parse_section to find each {'title': title, 'url': url, 'description':'', 'date':''} for each article on the page being processed. As I grab that data for each article, I'd test the title to see if it's what I want to appear. You said they are usually OK. If they aren't OK, you'll need to either create a title, if you can, or go to the URL and get a title from that page (again, BeautifulSoup is used to grab the info you want). Once you are happy with the data for the article, you append it to the current_articles list.

When you're done with the page, it returns to parse_index and your titles will be as you want them.

It sounds like a lot of trouble, but I don't see any other way to do it.

Now it is completely clear the way;
first i must process the feed, and try to find title, description,date,url and then use these values to override the "calibre" automatic value.
it is not so simple (for me) to understand the correct way to do that and the correct sequence for every step of the process. I am not so familiar with object oriented language...
Create a whole new feed, actually, for me, it's more clear in my mind.

edit
the nzeharld don't work

gambarini · 06-02-2010, 04:48 AM

Quote:

Originally Posted by Starson17

I suspect there might be some questions here that I can help with.... but perhaps not

More info about whether there's a question and what it is might help me decide.

this is my recipe:

Code:

from calibre.web.feeds.news import BasicNewsRecipe
class LaStampaParseIndex(BasicNewsRecipe):

 title                 = u'Debug Parse Index'
 cover_url             = 'http://www.lastampa.it/edicola/PDF/1.pdf'
 remove_javascript     = True
 no_stylesheets        = True


        
 def nz_parse_section(self, url):
            soup  = self.index_to_soup(url)
            head  = soup.find(attrs= {'class': 'entry'})
            descr = soup.find(attrs= {'class': 'feedEntryConteny'})
            dt    = soup.find(attrs= {'class': 'lastUpdated'})

            current_articles = []
            a = head.find('a', href = True)
            title       = self.tag_to_string(a)
            url         = a.get('href', False)
            description = self.tag_to_string(descr)
            date        = self.tag_to_string(dt)
            self.log('title ', title)
            self.log('url ', url)
            self.log('description ', description)
            self.log('date ', date)
            current_articles.append({'title': title, 'url': url, 'description':description, 'date':date})


            return current_articles
 keep_only_tags = [dict(attrs={'class':['boxocchiello2','titoloRub','titologir','catenaccio','sezione','articologirata']}),
                   dict(name='div', attrs={'id':'corpoarticolo'})
                  ]

 remove_tags = [dict(name='div', attrs={'id':'menutop'}),
                dict(name='div', attrs={'id':'fwnetblocco'}),
                dict(name='table', attrs={'id':'strumenti'}),
                dict(name='table', attrs={'id':'imgesterna'}),
                dict(name='a', attrs={'class':'linkblu'}),
                dict(name='a', attrs={'class':'link'}),
                dict(name='span', attrs={'class':['boxocchiello','boxocchiello2','sezione']})
               ]
 def parse_index(self):
            feeds = []
            for title, url in [(u'Politica', u'http://www.lastampa.it/redazione/cmssezioni/politica/rss_politica.xml'),
                               (u'Torino', u'http://rss.feedsportal.com/c/32418/f/466938/index.rss')
                              ]:
               articles = self.nz_parse_section(url)
               if articles:
                   feeds.append((title, articles))
            return feeds

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Custom column read ?	pchrist7	Calibre	2	10-04-2010 02:52 AM
Archive for custom screensavers	sleeplessdave	Amazon Kindle	1	07-07-2010 12:33 PM
How to back up preferences and custom recipes?	greenapple	Calibre	3	03-29-2010 05:08 AM
Donations for Custom Recipes	ddavtian	Calibre	5	01-23-2010 04:54 PM
Help understanding custom recipes	andersent	Calibre	0	12-17-2009 02:37 PM

05-31-2010, 10:30 PM	#2012
RedRoverJ Zealot Posts: 125 Karma: 314 Join Date: Apr 2010 Location: Canada, Eh! Device: Kobo	Are there any recipes specifically for the FIFA 2010 World Cup feeds? A couple on fifa.com that would be nice are: Latest News: http://www.fifa.com/rss/index.xml 2010 FIFA World Cup South Africa: http://www.fifa.com/worldcup/news/rss.xml

05-31-2010, 10:54 PM	#2013
bhandarisaurabh Enthusiast Posts: 49 Karma: 10 Join Date: Aug 2009 Device: none	can anyone help me with the recipe for this magazine http://www.foodprocessing360.com/ind...ate=12/05/2009

06-01-2010, 09:51 AM	#2017
gambarini Connoisseur Posts: 98 Karma: 22 Join Date: Mar 2010 Device: IRiver Story, Ipod Touch, Android SmartPhone	http://bugs.calibre-ebook.com/wiki/recipeGuide_advanced in this link a can find an example of parse_index, and is a good method to create a feed, a complete list of article. So, now i try to use the parse index in two different way: -) to override only the title (because lack in the feed, and because the other are correct (description, url, date)). -) to create a complete feed with all real first page of newspaper. the second way now is clear, but the first actualy not at all.