Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Closed Thread
 
Thread Tools Search this Thread
Old 09-11-2010, 06:11 AM   #2686
poloman
Enthusiast
poloman began at the beginning.
 
Posts: 25
Karma: 10
Join Date: Nov 2008
Device: PRS505, Kindle 3G
just wanted to say thanks to Tony for the tip on the Slashdot feed - works perfectly and fits my "clipping to read the full story later" workflow - thanks!

I'm going to use my new knowledge of recipes to improve the register recipe as it is all right aligned - havent tackled css until now, so should be fun!
poloman is offline  
Old 09-11-2010, 08:47 AM   #2687
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by TonytheBookworm View Post
Starson17,
If i wanted an if statement that checked if the parent was <div id='MainContent'>
how would I go about doing it?
would it be
Code:
mydaddy = item.parent
if mydaddy.name = 'MainContent'
  .......
item and mydaddy are BeautifulSoup Tag objects. Each Tag object has a specified list of properties. mydaddy is the parent of item. The name of mydaddy is 'div'. mydaddy has an attribute called 'id'. The value of that attribute is 'MainContent'.

You can access a tag's attributes by treating the Tag object as though it were a dictionary. Thus you want:
Code:
if mydaddy['id'] == 'MainContent'
  .......
or

Code:
if mydaddy.has_key('id') and mydaddy['id'] == 'MainContent'
  .......

Last edited by Starson17; 09-11-2010 at 08:55 AM.
Starson17 is offline  
Old 09-11-2010, 01:24 PM   #2688
TonytheBookworm
Addict
TonytheBookworm is on a distinguished road
 
TonytheBookworm's Avatar
 
Posts: 264
Karma: 62
Join Date: May 2010
Device: kindle 2, kindle 3, Kindle fire
Quote:
Originally Posted by Starson17 View Post
item and mydaddy are BeautifulSoup Tag objects. Each Tag object has a specified list of properties. mydaddy is the parent of item. The name of mydaddy is 'div'. mydaddy has an attribute called 'id'. The value of that attribute is 'MainContent'.

You can access a tag's attributes by treating the Tag object as though it were a dictionary. Thus you want:
Code:
if mydaddy['id'] == 'MainContent'
  .......
or

Code:
if mydaddy.has_key('id') and mydaddy['id'] == 'MainContent'
  .......
Thank you for explaining that to me.
TonytheBookworm is offline  
Old 09-11-2010, 01:49 PM   #2689
willswords
Junior Member
willswords began at the beginning.
 
Posts: 3
Karma: 10
Join Date: Sep 2010
Device: Kindle 3
DeseretNews.com recipe help request

Hi there. I'm new to Calibre and was wondering if someone could help me with my recipe for the Deseret News (Salt Lake City, Utah, USA Newspaper, http://desnews.com ). I've cobbled something together from what I have seen in other recipes, but I can't get it to use the mobile url instead of the regular one. The stories come through, but with all the extra stuff I don't want. The mobile versions of the articles look pretty clean though, but I must be doing something wrong because it isn't using the mobile url for the stories.

Here is what I have so far:

Spoiler:
Code:
from calibre.web.feeds.news import BasicNewsRecipe

class AdvancedUserRecipe1284222826(BasicNewsRecipe):
    title          = u'Deseret News mobile'
    __author__ =  'WillsWords'
    description = 'Deseret News selected feeds'
    category = 'news, politics, USA, Utah'
    oldest_article = 7
    max_articles_per_feed = 100
    no_stylesheets = True
    remove_javascript = True
    masthead_url = "http://www.deseretnews.com/media/img/icons/dn-masthead-logo.gif"

    feeds          = [(u'Top News', u'http://www.deseretnews.com/home/index.rss'), (u'Utah', u'http://www.deseretnews.com/utah/index.rss'), (u'Movies', u'http://www.deseretnews.com/movies/index.rss'), (u'LDS Newsline', u'http://www.deseretnews.com/ldsnews/index.rss'), (u'Sports', u'http://www.deseretnews.com/sports/index.rss')]

def print_version(self, url):
        split1 = url.split("/")
        #url1 = split1[0]
        #url2 = split1[1]
        url3 = split1[2]
        url4 = split1[3]
        url5 = split1[4]
        url6 = split1[5]


        #example of link to convert
        #http://www.deseretnews.com/article/700064426/Elizabeth-Smarts-father-joins-bike-ride-to-lobby-for-laws-protecting-children-from-predators.html
        #http://www.deseretnews.com/mobile/article/700064426/Elizabeth-Smarts-father-joins-bike-ride-to-lobby-for-laws-protecting-children-from-predators.html


        print_url = 'http://' + url3 + '/mobile/' + url4 + '/' + url5 + '/' + url6

        return print_url

Last edited by willswords; 09-11-2010 at 07:19 PM.
willswords is offline  
Old 09-11-2010, 02:46 PM   #2690
TonytheBookworm
Addict
TonytheBookworm is on a distinguished road
 
TonytheBookworm's Avatar
 
Posts: 264
Karma: 62
Join Date: May 2010
Device: kindle 2, kindle 3, Kindle fire
Quote:
Originally Posted by willswords View Post
Hi there. I'm new to Calibre and was wondering if someone could help me with my recipe for the Deseret News (Salt Lake City, Utah, USA Newspaper, http://desnews.com ). I've cobbled something together from what I have seen in other recipes, but I can't get it to use the mobile url instead of the regular one. The stories come through, but with all the extra stuff I don't want. The mobile versions of the articles look pretty clean though, but I must be doing something wrong because it isn't using the mobile url for the stories.

Here is what I have so far:

Spoiler:
Code:
from calibre.web.feeds.news import BasicNewsRecipe

class AdvancedUserRecipe1284222826(BasicNewsRecipe):
    title          = u'Deseret News mobile'
    __author__ =  'WillsWords'
    description = 'Deseret News selected feeds'
    category = 'news, politics, USA, Utah'
    oldest_article = 7
    max_articles_per_feed = 100
    no_stylesheets = True
    remove_javascript = True
    masthead_url = "http://www.deseretnews.com/media/img/icons/dn-masthead-logo.gif"

    feeds          = [(u'Top News', u'http://www.deseretnews.com/home/index.rss'), (u'Utah', u'http://www.deseretnews.com/utah/index.rss'), (u'Movies', u'http://www.deseretnews.com/movies/index.rss'), (u'LDS Newsline', u'http://www.deseretnews.com/ldsnews/index.rss'), (u'Sports', u'http://www.deseretnews.com/sports/index.rss')]

def print_version(self, url):
        split1 = url.split("/")
        #url1 = split1[0]
        #url2 = split1[1]
        url3 = split1[2]
        url4 = split1[3]
        url5 = split1[4]
        url6 = split1[5]


        #example of link to convert
        #http://www.deseretnews.com/article/700064426/Elizabeth-Smarts-father-joins-bike-ride-to-lobby-for-laws-protecting-children-from-predators.html
        #http://www.deseretnews.com/mobile/article/700064426/Elizabeth-Smarts-father-joins-bike-ride-to-lobby-for-laws-protecting-children-from-predators.html


        print_url = 'http://' + url3 + '/mobile/' + url4 + '/' + url5 + '/' + url6

        return print_url
here you go...
take note of the comments in the following code:
Spoiler:

Code:
from calibre.web.feeds.news import BasicNewsRecipe

class AdvancedUserRecipe1284222826(BasicNewsRecipe):
    title          = u'Deseret News mobile'
    __author__ =  'WillsWords'
    description = 'Deseret News selected feeds'
    category = 'news, politics, USA, Utah'
    oldest_article = 7
    max_articles_per_feed = 100
    no_stylesheets = True
    remove_javascript = True
    #I ADDED KEY_ONLY_TAGS to only keep the content section on the mobile page
    keep_only_tags     = [dict(name='div', attrs={'id':['content']})]
    #I ADDED REMOVE TAGS TO GET RID OF THE COMMENTS AND THE TOOL BAR AT THE TOP
    remove_tags = [dict(name='div', attrs={'id':['tools','story-comments']})] 
                          
    masthead_url = "http://www.deseretnews.com/media/img/icons/dn-masthead-logo.gif"

    feeds          = [(u'Top News', u'http://www.deseretnews.com/home/index.rss'), (u'Utah', u'http://www.deseretnews.com/utah/index.rss'), (u'Movies', u'http://www.deseretnews.com/movies/index.rss'), (u'LDS Newsline', u'http://www.deseretnews.com/ldsnews/index.rss'), (u'Sports', u'http://www.deseretnews.com/sports/index.rss')]

    #I FIXED YOUR INDENT it was all the way to the left it has to be within the class so align it with the indent 
    #of title, remove_javascript, ect...
    
    def print_version(self, url):
        split1 = url.split("/")
        url3 = split1[2]
        url4 = split1[3]
        url5 = split1[4]
        url6 = split1[5]


        #example of link to convert
        #http://www.deseretnews.com/article/700064426/Elizabeth-Smarts-father-joins-bike-ride-to-lobby-for-laws-protecting-children-from-predators.html
        #http://www.deseretnews.com/mobile/article/700064426/Elizabeth-Smarts-father-joins-bike-ride-to-lobby-for-laws-protecting-children-from-predators.html


        print_url = 'http://' + url3 + '/mobile/' + url4 + '/' + url5 + '/' + url6
        #I ADDED THE FOLLOWING TO SHOW YOU IN THE LOG FILE WHAT THE ACTUAL PRINT URL IS.  Once you see it showing the 
        #the currect url then you should be good to go other than just cleaning up a few tags by using keep only and remove
        print 'THIS URL WILL PRINT: ', print_url
        return print_url
TonytheBookworm is offline  
Old 09-11-2010, 07:11 PM   #2691
willswords
Junior Member
willswords began at the beginning.
 
Posts: 3
Karma: 10
Join Date: Sep 2010
Device: Kindle 3
Awesome! Thanks for your help. This is exactly what I wanted. I really need to learn python so I can understand this better.
You mentioned in the comments that there is a log file... Where does Calibre save that? I poked around but couldn't find one.

Last edited by willswords; 09-11-2010 at 07:20 PM.
willswords is offline  
Old 09-11-2010, 07:35 PM   #2692
TonytheBookworm
Addict
TonytheBookworm is on a distinguished road
 
TonytheBookworm's Avatar
 
Posts: 264
Karma: 62
Join Date: May 2010
Device: kindle 2, kindle 3, Kindle fire
Quote:
Originally Posted by willswords View Post
Awesome! Thanks for your help. This is exactly what I wanted. I really need to learn python so I can understand this better.
You mentioned in the comments that there is a log file... Where does Calibre save that? I poked around but couldn't find one.
I was speaking about the output file that is generated when you test your recipes at the command prompt.

ebook-convert recipename.recipe output_dir --test -vv > myrecipe.txt

for example if your recipe was called billybob.recipe go to the command prompt in windows (cmd) and then find the calibre2 directory which in windows it is generally in c:/program files
then in there type:
ebook-convert billybob.recipe output_dir --test -vv > myrecipe.txt

once the execution finishes then you can open up the myrecipe.txt file in wordpad, notepad, editor whatever and then look for errors or printed text to see if the code is doing what you wish for it to do.

test will take and pull the first 2 articles for you.
TonytheBookworm is offline  
Old 09-11-2010, 10:14 PM   #2693
somedayson
Member
somedayson began at the beginning.
 
Posts: 13
Karma: 10
Join Date: Sep 2010
Device: K3
Wanted to thank Starson17 and TonytheBookworm for all their help... I still don't fully understand all that I'm doing, but because of your help and the 180 pages on this board, I've got some awesome stuff on my Kindle each day. Thanks to all!
somedayson is offline  
Old 09-11-2010, 10:24 PM   #2694
cynvision
Member
cynvision began at the beginning.
 
Posts: 14
Karma: 10
Join Date: Sep 2010
Device: nook
Okay. I've joined the club using .bat files to run the command line so all things are good there. But! I'm working on a little thing that grabs the update from the National Hurricane Center and I thought it was so simple I'm working on it in Calibre's window. I know from the thread here that there's the complicated way of using parse_index or get_links to filter a set of links to follow, but I really wanted to filter by the built-in methods and I must be doing them all wrong because they don't work on my computer.

I just want it to ignore the links ending in '.zip' and '.kmz' Where am I going wrong with raw reg expressions? I've tried with and without "\", with and without "r", and I'm going buggy. (Now I know the text link says .shp, but they're .zip packages that are linked. Right now they or the kmz are being read into the e-book as text.)

Spoiler:
Code:
import re
class NationalHurricaneCenter(BasicNewsRecipe):
    title          = u'National Hurricane Center (Atlantic)'
    oldest_article = 1
    max_articles_per_feed = 15

    feeds          = [(u'Atlantic Basin Tropical Advisories', u'http://www.nhc.noaa.gov/index-at.xml'),
                          (u'Flight Plan Of The Day', u'http://www.nhc.noaa.gov/xml/REPRPD.xml')]
    no_stylesheets = True
    remove_javascript     = True
    use_embedded_content  = False
    remove_attributes  = ['width','height']
    encoding = 'utf-8'
    masthead_url          = 'http://www.nhc.noaa.gov/gifs/xml_logo_nhc.gif'

    conversion_options = {
                   'linearize_tables' : True,
                    }
#this needs work, it's not avoiding the url's .zip or .kmz
#    filter_regexps =  [r'\.kmz'] 
#    preprocess_regexps     = [(re.compile(r'\.kmz', re.DOTALL), lambda m: '')]
    match_regexps = [r'\.shtml']


#    remove_tags_before  = dict(name='h2')
#    remove_tags_after  = dict(name='pre')

    keep_only_tags = [dict(name='h2'), dict(name='pre')]

#    remove_tags = [
#                     dict(name='div' , attrs={'class':'topbanner_780' }),
#                     dict(name='div' , attrs={'class':'navbkgrnd' }),
#                    ]
#http://www.nhc.noaa.gov/text/WTUS84-KBRO.shtml
#http://www.nhc.noaa.gov/text/WTUS84-KBRO.shtml?text

#http://www.nhc.noaa.gov/text/refresh/SJUTCPAT1+shtml/111458.shtml
#http://www.nhc.noaa.gov/text/refresh/SJUTCPAT1+shtml/111458.shtml?text
    def print_version(self, url):
                        return url + '?text'
cynvision is offline  
Old 09-11-2010, 10:33 PM   #2695
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by cynvision View Post
I just want it to ignore the links ending in '.zip' and '.kmz' Where am I going wrong with raw reg expressions?
AFAIK, filter_regexps applies to links in the article, not links in the feed.
Starson17 is offline  
Old 09-11-2010, 11:31 PM   #2696
cynvision
Member
cynvision began at the beginning.
 
Posts: 14
Karma: 10
Join Date: Sep 2010
Device: nook
Quote:
Originally Posted by somedayson View Post
Further down the list would be trying to find out how to get this column every week:
http://www.aspentimes.com/article/20...ntprofile=1061
thanks for any help anyone can provide....just an awesome program!
Grateful,
Matt
I took a look at this one and if there was a way to get the RSS of the weekly archive... but I don't see one. Maybe it's members only?
cynvision is offline  
Old 09-12-2010, 12:47 AM   #2697
TonytheBookworm
Addict
TonytheBookworm is on a distinguished road
 
TonytheBookworm's Avatar
 
Posts: 264
Karma: 62
Join Date: May 2010
Device: kindle 2, kindle 3, Kindle fire
Quote:
Originally Posted by cynvision View Post
I took a look at this one and if there was a way to get the RSS of the weekly archive... but I don't see one. Maybe it's members only?
I didn't see an rss for it so I just parsed the links.
I'm not certain if the link will list the feeds each time for I have no known way of testing that. But the following code will use the link provided by the original poster and then parse the links on that page and look for Alison Berkley in the link text. If it finds it then that link will be used and converted in the print_url to the pretty print version...
Spoiler:

Code:
from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup, re
class FIELDSTREAM(BasicNewsRecipe):
    title      = 'Alison Berkley Column'
    __author__ = 'Tonythebookworm'
    description = 'Some dudes column'
    language = 'en'
    no_stylesheets = True
    publisher           = 'Tonythebookworm'
    category            = 'column'
    use_embedded_content= False
    no_stylesheets      = True
    oldest_article      = 24
    remove_javascript   = True
    remove_empty_feeds  = True
    
    max_articles_per_feed = 10
    INDEX = 'http://www.aspentimes.com'
    
    
    def parse_index(self):
        feeds = []
        for title, url in [
                            (u"Alison Berkley", u"http://www.aspentimes.com/SECTION/&Profile=1021&ParentProfile=1061"),
                            
                            
                             ]:
            articles = self.make_links(url)
            if articles:
                feeds.append((title, articles))
        return feeds
        
    def make_links(self, url):
        title = 'Temp'
        current_articles = []
        soup = self.index_to_soup(url)
        print 'The soup is: ', soup
        for item in soup.findAll('div',attrs={'class':'title'}):
            print 'item is: ', item
            link = item.find('a')
            print 'the link is: ', link
            titlecheck = self.tag_to_string(link)
            #once we get a link we need to check to see if it contains Alison Berkley and if it does use it
            if link.find(text=re.compile('Alison Berkley')) :
                print 'FOUND TITLE AND IT IS : ', titlecheck
            
                url         = self.INDEX + link['href']
                title       = self.tag_to_string(link)
                print 'the title is: ', title
                print 'the url is: ', url
                print 'the title is: ', title
                current_articles.append({'title': title, 'url': url, 'description':'', 'date':''}) # append all this
        return current_articles
        
        
    def print_version(self, url):
        split1 = url.split("article")
        print 'THE SPLIT IS: ', split1 
        #original is: http://www.aspentimes.com/article/20100909/COLUMN/100909869/1021&parentprofile=1061
        #need this to be print_url:
        #http://www.aspentimes.com/apps/pbcs.dll/article?AID=/20100909/COLUMN/100909869/1021&parentprofile=1061&template=printart         
         
        print_url = 'http://www.aspentimes.com/apps/pbcs.dll/article?AID=' + split1[1] + '&template=printart'
        print 'THIS URL WILL PRINT: ', print_url # this is a test string to see what the url is it will return
        return print_url
TonytheBookworm is offline  
Old 09-12-2010, 01:51 AM   #2698
cynvision
Member
cynvision began at the beginning.
 
Posts: 14
Karma: 10
Join Date: Sep 2010
Device: nook
Quote:
Originally Posted by Lukas238 View Post
I was trying to create a recipe for http://www.heavens-above.com, excellent Astronomy website where you can see the predictions of satellite positions, as will be seen from your city.

The problem is that the site offers no feeds.

Nonetheless, I could create something pretty close. I could download some content, but appear as html code, not as text.

This is the recipe as far as it went. The user login is not required to access any of the pages, but if you logger all pages display the information as seen from your city.
Thanks for trying this one out. I've been missing this since AvantGo packed up and went out of business. One thing that better brains on Calibre might have input on is how to capture the "Session" variable after login and use it in the URL's. I'm thinking using an old value in the feed probably gives their server a complex.
Now, cleaning up what comes back is a bit daunting. If my results equal yours, all I'm getting is a text string of the table served back to Calibre and that's what's in the page for each feed.
I wanted this to work so bad I spent over an hour and didn't get very far.
Attached Files
File Type: txt heavensAboverecipe.txt (3.1 KB, 304 views)

Last edited by cynvision; 09-12-2010 at 01:54 AM. Reason: didn't attach file properly
cynvision is offline  
Old 09-12-2010, 02:03 AM   #2699
cynvision
Member
cynvision began at the beginning.
 
Posts: 14
Karma: 10
Join Date: Sep 2010
Device: nook
Quote:
Originally Posted by TonytheBookworm View Post
I didn't see an rss for it so I just parsed the links.
I'm not certain if the link will list the feeds each time for I have no known way of testing that. But the following code will use the link provided by the original poster and then parse the links on that page and look for Alison Berkley in the link text. If it finds it then that link will be used and converted in the print_url to the pretty print version...
Ah yes. I'm still not comfortable with how the multiple page link following works. You'd have to follow the 'more articles' link at least once to get more than one article from that author.
cynvision is offline  
Old 09-12-2010, 02:49 AM   #2700
TonytheBookworm
Addict
TonytheBookworm is on a distinguished road
 
TonytheBookworm's Avatar
 
Posts: 264
Karma: 62
Join Date: May 2010
Device: kindle 2, kindle 3, Kindle fire
Quote:
Originally Posted by cynvision View Post
Ah yes. I'm still not comfortable with how the multiple page link following works. You'd have to follow the 'more articles' link at least once to get more than one article from that author.
I'll work on it tomorrow after the football games go off. I think I know how to solve it. Look at the down to earth recipe I posted. I think the same idea would work for this.
Basically take and parse the page for href's and put them in the soup. Then turn around and take that soup and parse it again and search using the re.compile that I already have in this. If you can't figure it out I'll see what I can come up with tomorrow. But it is 3am where I'm at and i'm tired zzzzzzzz
TonytheBookworm is offline  
Closed Thread


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Custom column read ? pchrist7 Calibre 2 10-04-2010 02:52 AM
Archive for custom screensavers sleeplessdave Amazon Kindle 1 07-07-2010 12:33 PM
How to back up preferences and custom recipes? greenapple Calibre 3 03-29-2010 05:08 AM
Donations for Custom Recipes ddavtian Calibre 5 01-23-2010 04:54 PM
Help understanding custom recipes andersent Calibre 0 12-17-2009 02:37 PM


All times are GMT -4. The time now is 10:33 PM.


MobileRead.com is a privately owned, operated and funded community.