Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Closed Thread
 
Thread Tools Search this Thread
Old 05-31-2010, 07:34 PM   #2011
kiklop74
Guru
kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.
 
kiklop74's Avatar
 
Posts: 800
Karma: 194644
Join Date: Dec 2007
Location: Argentina
Device: Kindle Voyage
New recipe for Bosnian portal sarajevo-x.com:
Attached Files
File Type: zip sarajevo-x.zip (1.8 KB, 225 views)
kiklop74 is offline  
Old 05-31-2010, 10:30 PM   #2012
RedRoverJ
Zealot
RedRoverJ has a complete set of Star Wars action figures.RedRoverJ has a complete set of Star Wars action figures.RedRoverJ has a complete set of Star Wars action figures.RedRoverJ has a complete set of Star Wars action figures.
 
Posts: 125
Karma: 314
Join Date: Apr 2010
Location: Canada, Eh!
Device: Kobo
Are there any recipes specifically for the FIFA 2010 World Cup feeds? A couple on fifa.com that would be nice are:

Latest News:
http://www.fifa.com/rss/index.xml

2010 FIFA World Cup South Africa:
http://www.fifa.com/worldcup/news/rss.xml
RedRoverJ is offline  
Old 05-31-2010, 10:54 PM   #2013
bhandarisaurabh
Enthusiast
bhandarisaurabh began at the beginning.
 
Posts: 49
Karma: 10
Join Date: Aug 2009
Device: none
can anyone help me with the recipe for this magazine
http://www.foodprocessing360.com/ind...ate=12/05/2009
bhandarisaurabh is offline  
Old 05-31-2010, 11:53 PM   #2014
23n
Junior Member
23n began at the beginning.
 
Posts: 3
Karma: 10
Join Date: May 2010
Location: Calgary, AB, Canada
Device: iPad
Quote:
Originally Posted by Starson17 View Post
This web page reproduces the RSS feed (at least for the first 3 feeds I checked.) Calibre has a builtin recipe for The Register RSS feed. Why don't you look at that one first to see if it meets your needs.
Ya, I looked at it but it doesn't give the full week of stories (trust me, I need the full week at times). Also, I would really appreciate the days being headers in the TOC. I've developed a really strong preference of reading that site by date rather than section. As it is, I'm using the recipe that comes with Calibre but have it only getting the week.html page. And none of the articles from the reghardware.com site are retrieved which is the other issue I identified.

I am pretty wimpy when it comes to python. I can sed and perl pretty well, can do a bit with awk, but python just makes my brain hurt. That's part of why I like sitescooper so much and the simplicity of their .site files.

Thanks for your suggestion, though!

Cheers!
23n is offline  
Old 06-01-2010, 12:02 AM   #2015
23n
Junior Member
23n began at the beginning.
 
Posts: 3
Karma: 10
Join Date: May 2010
Location: Calgary, AB, Canada
Device: iPad
Quote:
Originally Posted by Starson17 View Post
If you want to try it yourself, this needs parse_index. Look here.
Thanks for the suggestion. I haven't seen this page before, so I'll look into it. I was trying to admit I suck at python and was bowing to the practiced and superior abilities that I've seen on this forum. I'm guessing that four hours of my time trying to get this done would be about 15 minutes (or less!) of effect from some of the people of this forum. I will eventually try to do it myself but I thought that if someone was able to quickly achieve what I'm asking, it would save me (literally) hours of frustration and garner heaps of my appreciation!

Cheers!
23n is offline  
Old 06-01-2010, 02:09 AM   #2016
Newby
Enthusiast
Newby began at the beginning.
 
Posts: 33
Karma: 10
Join Date: May 2010
Device: Bookeen Cybook Gen3 Gold
Quote:
Originally Posted by kiklop74 View Post
New recipe for Bosnian portal sarajevo-x.com:

Thank you very much!
Newby is offline  
Old 06-01-2010, 09:51 AM   #2017
gambarini
Connoisseur
gambarini began at the beginning.
 
Posts: 98
Karma: 22
Join Date: Mar 2010
Device: IRiver Story, Ipod Touch, Android SmartPhone
http://bugs.calibre-ebook.com/wiki/recipeGuide_advanced

in this link a can find an example of parse_index, and is a good method to create a feed, a complete list of article.
So, now i try to use the parse index in two different way:

-) to override only the title (because lack in the feed, and because the other are correct (description, url, date)).
-) to create a complete feed with all real first page of newspaper.

the second way now is clear, but the first actualy not at all.
gambarini is offline  
Old 06-01-2010, 09:56 AM   #2018
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by gambarini View Post

thanks in advance
Clearly, I wasn't clear that what you wrote wasn't clear to me. To clear things up, I have to ask you to be more clear. I'm sure that it is now clear that your thanks are premature.

(To rephrase the above: Why don't you repost your questions, in greater detail, if you still have any. I really couldn't figure out what help you were asking for.)

Edit: I see you did that .. while I was writing my comment.

Last edited by Starson17; 06-01-2010 at 09:58 AM.
Starson17 is offline  
Old 06-01-2010, 10:20 AM   #2019
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by gambarini View Post
http://bugs.calibre-ebook.com/wiki/recipeGuide_advanced

in this link a can find an example of parse_index, and is a good method to create a feed, a complete list of article.
So, now i try to use the parse index in two different way:

-) to override only the title (because lack in the feed, and because the other are correct (description, url, date)).
-) to create a complete feed with all real first page of newspaper.

the second way now is clear, but the first actualy not at all.
So what have you tried? The page you reference explains how parse_index works. You create your own set of feeds. Each feed has a title and a set of articles. The set of feeds is created in the line:
feeds.append((title, articles))
of parse_index. The "title" there is the feed title.

The articles for each feed are created in nz_parse_section of the example in this line:

current_articles.append({'title': title, 'url': url, 'description':'', 'date':''})

The "title" there is the article title.

It appears you want to control the article titles, not the feed title. I'd do it this way:

First, I'd use parse_index to process each RSS feed I want (you may only need one). Parse_index will treat each RSS feed page as a web page. You can grab what you want from that page using BeautifulSoup. I'd use a modified version of nz_parse_section to find each {'title': title, 'url': url, 'description':'', 'date':''} for each article on the page being processed. As I grab that data for each article, I'd test the title to see if it's what I want to appear. You said they are usually OK. If they aren't OK, you'll need to either create a title, if you can, or go to the URL and get a title from that page (again, BeautifulSoup is used to grab the info you want). Once you are happy with the data for the article, you append it to the current_articles list.

When you're done with the page, it returns to parse_index and your titles will be as you want them.

It sounds like a lot of trouble, but I don't see any other way to do it.
Starson17 is offline  
Old 06-01-2010, 10:30 AM   #2020
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by 23n View Post
Thanks for the suggestion. I haven't seen this page before, so I'll look into it. I was trying to admit I suck at python and was bowing to the practiced and superior abilities that I've seen on this forum. I'm guessing that four hours of my time trying to get this done would be about 15 minutes (or less!) of effect from some of the people of this forum. I will eventually try to do it myself but I thought that if someone was able to quickly achieve what I'm asking, it would save me (literally) hours of frustration and garner heaps of my appreciation!

Cheers!
Recipes that require parsing a web page for feed info typically take more time than recipes based on an RSS feed. With the RSS feed, you just have to clean up the output. With a web page, you first have to write the code to parse the page, and that just gets you to the same point you would start at with an RSS feed. You looked like someone who might have the skill and desire to DIY, so I thought I'd steer you towards where you could find the info you'd need if you wanted to try.
Starson17 is offline  
Old 06-01-2010, 05:55 PM   #2021
TonytheBookworm
Addict
TonytheBookworm is on a distinguished road
 
TonytheBookworm's Avatar
 
Posts: 264
Karma: 62
Join Date: May 2010
Device: kindle 2, kindle 3, Kindle fire
Quote:
Originally Posted by Starson17 View Post
Fark has an RSS feed, and I looked at it. It seems to have a one sentence description of an article on another site and a slew of comments. Do you just want the one sentence from Fark with the link, or do you want the comments? The content of the linked articles is probably too variable to easily add, as it comes from dozens of different sources, each with a different page structure. You'd get lots of junk with each one.
I didn't realize they had a RSS feed. I will look at that. I really just wanted what you mentioned a list of links with the fark comment..... like for instance

Newsweek: Bizarre - Guy stung in rear by numerous bees ends up harboring a honeycomb in his rectum.

and then it link me to some story.... I'll look at the RSS feed that is probably what i need anyway. thanks for the help though.
TonytheBookworm is offline  
Old 06-01-2010, 09:59 PM   #2022
kidtwisted
Member
kidtwisted began at the beginning.
 
kidtwisted's Avatar
 
Posts: 16
Karma: 10
Join Date: May 2010
Location: Southern California
Device: JetBook-Lite
Quote:
Originally Posted by Starson17 View Post
You need to use multipage code. Here's an example from the adventuregamers.recipe builtin:

Code:
    def append_page(self, soup, appendtag, position):
        pager = soup.find('div',attrs={'class':'toolbar_fat_next'})
        if pager:
           nexturl = self.INDEX + pager.a['href']
           soup2 = self.index_to_soup(nexturl)
           texttag = soup2.find('div', attrs={'class':'bodytext'})
           for it in texttag.findAll(style=True):
               del it['style']
           newpos = len(texttag.contents)          
           self.append_page(soup2,texttag,newpos)
           texttag.extract()
           appendtag.insert(position,texttag)
        
    
    def preprocess_html(self, soup):
        mtag = '<meta http-equiv="Content-Language" content="en-US"/>\n<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>'
        soup.head.insert(0,mtag)    
        for item in soup.findAll(style=True):
            del item['style']
        self.append_page(soup, soup.body, 3)
        pager = soup.find('div',attrs={'class':'toolbar_fat'})
        if pager:
           pager.extract()        
        return soup
append_page recursively looks for the next page tag ('div',attrs={'class':'toolbar_fat_next'}), gets the text and inserts it into the soup at the point where the tag was found until all pages have been inserted.

preprocess_html uses append_page to modify the html. You'll need to look for the next page tag on your site and adjust accordingly. This should get you started.

Do your testing with -vv and --test
as in:
ebook-convert pcper.recipe pcper --test -vv> pcper.txt
Hey Starson17,
I have 2 site that I'm tiring to get the multi-page code working on, pcper.com and tweaktown.com. Both these sites have similar layouts thou tweaktown.com source code seems a bit better to learn with, so I've been workin with that one.

I'm kinda stuck, when I add the append_page code the test html only contains the feed description and date, with out it I get the 1st page so I'm screwing it up somewhere.

here's what I have for tweaktown.com:
Code:
class AdvancedUserRecipe1273795663(BasicNewsRecipe):
    title = u'TweakTown Latest Tech'
    description = 'TweakTown Latest Tech'
    __author__ = 'KidTwisted'
    publisher             = 'TweakTown'
    category              = 'PC Articles, Reviews and Guides'
    use_embedded_content   = False
    max_articles_per_feed = 1
    oldest_article = 7
    timefmt  = ' [%Y %b %d ]'
    no_stylesheets = True
    language = 'en'
    #recursion             = 10
    remove_javascript = True
    conversion_options = { 'linearize_tables' : True}
   # reverse_article_order = True
    #INDEX                 = u'http://www.tweaktown.com'

    html2lrf_options = [
                          '--comment', description
                        , '--category', category
                        , '--publisher', publisher
                        ]
    
    html2epub_options = 'publisher="' + publisher + '"\ncomments="' + description + '"\ntags="' + category + '"' 
	
    keep_only_tags = [dict(name='div', attrs={'id':['article']})]

    feeds =  [ (u'Articles Reviews', u'http://feeds.feedburner.com/TweaktownArticlesReviewsAndGuidesRss20?format=xml') ]

    def get_article_url(self, article):
        return article.get('guid',  None)
    
    def append_page(self, soup, appendtag, position):
        pager = soup.find('a',attrs={'class':'next'})
        if pager:
           nexturl = pager.a['href']
           soup2 = self.index_to_soup(nexturl)
           texttag = soup2.find('div', attrs={'id':'article'})
           for it in texttag.findAll(style=True):
               del it['style']
           newpos = len(texttag.contents)          
           self.append_page(soup2,texttag,newpos)
           texttag.extract()
           appendtag.insert(position,texttag)
        
    
    def preprocess_html(self, soup):
        mtag = '<meta http-equiv="Content-Language" content="en-US"/>\n<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>'
        soup.head.insert(0,mtag)    
        for item in soup.findAll(style=True):
            del item['style']
        self.append_page(soup, soup.body, 3)
        pager = soup.find('a',attrs={'class':'next'})
        if pager:
           pager.extract()        
        return soup
Could you or someone in the know take a look at it to see what I'm doing wrong. I commented out "INDEX" because the link for the next page is a complete link, any help on this would be great.

Last edited by kidtwisted; 06-01-2010 at 11:36 PM.
kidtwisted is offline  
Old 06-01-2010, 11:17 PM   #2023
kidtwisted
Member
kidtwisted began at the beginning.
 
kidtwisted's Avatar
 
Posts: 16
Karma: 10
Join Date: May 2010
Location: Southern California
Device: JetBook-Lite
Just a side thought to my previous post, both of those site use Article index drop down boxes that contain links to all the pages of the article.
example source code from pcper.com:
Code:
<form method="post" action="/article.php">
		  <b>Review Index:</b><br>
		  <select style="font-size: 75%;" onchange="location.href=form.url.options[form.url.selectedIndex].value" name="url">
		 	 <option select=""> - Select - </option>

       <option value="article.php?aid=926&amp;type=expert&amp;pid=1" select="">A complete lineup</option>
  
       <option value="article.php?aid=926&amp;type=expert&amp;pid=2" select="">FirePro V7800 and V4800 Cards</option>
  
       <option value="article.php?aid=926&amp;type=expert&amp;pid=3" select="">Testing Methodology, System Setup and CineBench 11/10</option>
  
       <option value="article.php?aid=926&amp;type=expert&amp;pid=4" select="">SPECviewperf 10</option>
  
       <option value="article.php?aid=926&amp;type=expert&amp;pid=5" select="">SPECviewperf 10 - Multisample Testing</option>
  
       <option value="article.php?aid=926&amp;type=expert&amp;pid=6" select="">SPECviewperf 10 - Multithreaded testing</option>
  
       <option value="article.php?aid=926&amp;type=expert&amp;pid=7" select="">3DMark Vantage</option>
  
       <option value="article.php?aid=926&amp;type=expert&amp;pid=8" select="">Power Consumption and Conclusions</option>
  
      </select>
		 </form>
How could I build my article from there instead of the next page button?
kidtwisted is offline  
Old 06-02-2010, 03:07 AM   #2024
gambarini
Connoisseur
gambarini began at the beginning.
 
Posts: 98
Karma: 22
Join Date: Mar 2010
Device: IRiver Story, Ipod Touch, Android SmartPhone
Quote:
Originally Posted by Starson17 View Post
So what have you tried? The page you reference explains how parse_index works. You create your own set of feeds. Each feed has a title and a set of articles. The set of feeds is created in the line:
feeds.append((title, articles))
of parse_index. The "title" there is the feed title.

The articles for each feed are created in nz_parse_section of the example in this line:

current_articles.append({'title': title, 'url': url, 'description':'', 'date':''})

The "title" there is the article title.

It appears you want to control the article titles, not the feed title. I'd do it this way:

First, I'd use parse_index to process each RSS feed I want (you may only need one). Parse_index will treat each RSS feed page as a web page. You can grab what you want from that page using BeautifulSoup. I'd use a modified version of nz_parse_section to find each {'title': title, 'url': url, 'description':'', 'date':''} for each article on the page being processed. As I grab that data for each article, I'd test the title to see if it's what I want to appear. You said they are usually OK. If they aren't OK, you'll need to either create a title, if you can, or go to the URL and get a title from that page (again, BeautifulSoup is used to grab the info you want). Once you are happy with the data for the article, you append it to the current_articles list.

When you're done with the page, it returns to parse_index and your titles will be as you want them.

It sounds like a lot of trouble, but I don't see any other way to do it.
Now it is completely clear the way;
first i must process the feed, and try to find title, description,date,url and then use these values to override the "calibre" automatic value.
it is not so simple (for me) to understand the correct way to do that and the correct sequence for every step of the process. I am not so familiar with object oriented language...
Create a whole new feed, actually, for me, it's more clear in my mind.



edit
the nzeharld don't work

Last edited by gambarini; 06-02-2010 at 05:01 AM.
gambarini is offline  
Old 06-02-2010, 04:48 AM   #2025
gambarini
Connoisseur
gambarini began at the beginning.
 
Posts: 98
Karma: 22
Join Date: Mar 2010
Device: IRiver Story, Ipod Touch, Android SmartPhone
Quote:
Originally Posted by Starson17 View Post
I suspect there might be some questions here that I can help with.... but perhaps not

More info about whether there's a question and what it is might help me decide.
this is my recipe:
Code:
from calibre.web.feeds.news import BasicNewsRecipe
class LaStampaParseIndex(BasicNewsRecipe):

 title                 = u'Debug Parse Index'
 cover_url             = 'http://www.lastampa.it/edicola/PDF/1.pdf'
 remove_javascript     = True
 no_stylesheets        = True


        
 def nz_parse_section(self, url):
            soup  = self.index_to_soup(url)
            head  = soup.find(attrs= {'class': 'entry'})
            descr = soup.find(attrs= {'class': 'feedEntryConteny'})
            dt    = soup.find(attrs= {'class': 'lastUpdated'})

            current_articles = []
            a = head.find('a', href = True)
            title       = self.tag_to_string(a)
            url         = a.get('href', False)
            description = self.tag_to_string(descr)
            date        = self.tag_to_string(dt)
            self.log('title ', title)
            self.log('url ', url)
            self.log('description ', description)
            self.log('date ', date)
            current_articles.append({'title': title, 'url': url, 'description':description, 'date':date})


            return current_articles
 keep_only_tags = [dict(attrs={'class':['boxocchiello2','titoloRub','titologir','catenaccio','sezione','articologirata']}),
                   dict(name='div', attrs={'id':'corpoarticolo'})
                  ]

 remove_tags = [dict(name='div', attrs={'id':'menutop'}),
                dict(name='div', attrs={'id':'fwnetblocco'}),
                dict(name='table', attrs={'id':'strumenti'}),
                dict(name='table', attrs={'id':'imgesterna'}),
                dict(name='a', attrs={'class':'linkblu'}),
                dict(name='a', attrs={'class':'link'}),
                dict(name='span', attrs={'class':['boxocchiello','boxocchiello2','sezione']})
               ]
 def parse_index(self):
            feeds = []
            for title, url in [(u'Politica', u'http://www.lastampa.it/redazione/cmssezioni/politica/rss_politica.xml'),
                               (u'Torino', u'http://rss.feedsportal.com/c/32418/f/466938/index.rss')
                              ]:
               articles = self.nz_parse_section(url)
               if articles:
                   feeds.append((title, articles))
            return feeds
gambarini is offline  
Closed Thread


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Custom column read ? pchrist7 Calibre 2 10-04-2010 02:52 AM
Archive for custom screensavers sleeplessdave Amazon Kindle 1 07-07-2010 12:33 PM
How to back up preferences and custom recipes? greenapple Calibre 3 03-29-2010 05:08 AM
Donations for Custom Recipes ddavtian Calibre 5 01-23-2010 04:54 PM
Help understanding custom recipes andersent Calibre 0 12-17-2009 02:37 PM


All times are GMT -4. The time now is 03:50 AM.


MobileRead.com is a privately owned, operated and funded community.