Custom recipes (archive, read-only) - Page 180

poloman · 09-11-2010, 06:11 AM

just wanted to say thanks to Tony for the tip on the Slashdot feed - works perfectly and fits my "clipping to read the full story later" workflow - thanks!

I'm going to use my new knowledge of recipes to improve the register recipe as it is all right aligned - havent tackled css until now, so should be fun!

Starson17 · 09-11-2010, 08:47 AM

Quote:

Originally Posted by TonytheBookworm

Starson17,
If i wanted an if statement that checked if the parent was <div id='MainContent'>
how would I go about doing it?
would it be

Code:

mydaddy = item.parent
if mydaddy.name = 'MainContent'
  .......

item and mydaddy are BeautifulSoup Tag objects. Each Tag object has a specified list of properties. mydaddy is the parent of item. The name of mydaddy is 'div'. mydaddy has an attribute called 'id'. The value of that attribute is 'MainContent'.

You can access a tag's attributes by treating the Tag object as though it were a dictionary. Thus you want:

Code:

if mydaddy['id'] == 'MainContent'
  .......

or

Code:

if mydaddy.has_key('id') and mydaddy['id'] == 'MainContent'
  .......

TonytheBookworm · 09-11-2010, 01:24 PM

Quote:

Originally Posted by Starson17

item and mydaddy are BeautifulSoup Tag objects. Each Tag object has a specified list of properties. mydaddy is the parent of item. The name of mydaddy is 'div'. mydaddy has an attribute called 'id'. The value of that attribute is 'MainContent'.

You can access a tag's attributes by treating the Tag object as though it were a dictionary. Thus you want:

Code:

if mydaddy['id'] == 'MainContent'
  .......

or

Code:

if mydaddy.has_key('id') and mydaddy['id'] == 'MainContent'
  .......

Thank you for explaining that to me.

willswords · 09-11-2010, 01:49 PM

Hi there. I'm new to Calibre and was wondering if someone could help me with my recipe for the Deseret News (Salt Lake City, Utah, USA Newspaper, http://desnews.com ). I've cobbled something together from what I have seen in other recipes, but I can't get it to use the mobile url instead of the regular one. The stories come through, but with all the extra stuff I don't want. The mobile versions of the articles look pretty clean though, but I must be doing something wrong because it isn't using the mobile url for the stories.

Here is what I have so far:

Spoiler:

TonytheBookworm · 09-11-2010, 02:46 PM

Quote:

Originally Posted by willswords

Hi there. I'm new to Calibre and was wondering if someone could help me with my recipe for the Deseret News (Salt Lake City, Utah, USA Newspaper, http://desnews.com ). I've cobbled something together from what I have seen in other recipes, but I can't get it to use the mobile url instead of the regular one. The stories come through, but with all the extra stuff I don't want. The mobile versions of the articles look pretty clean though, but I must be doing something wrong because it isn't using the mobile url for the stories.

Here is what I have so far:

Spoiler:

here you go...
take note of the comments in the following code:

Spoiler:

Code:

from calibre.web.feeds.news import BasicNewsRecipe

class AdvancedUserRecipe1284222826(BasicNewsRecipe):
    title          = u'Deseret News mobile'
    __author__ =  'WillsWords'
    description = 'Deseret News selected feeds'
    category = 'news, politics, USA, Utah'
    oldest_article = 7
    max_articles_per_feed = 100
    no_stylesheets = True
    remove_javascript = True
    #I ADDED KEY_ONLY_TAGS to only keep the content section on the mobile page
    keep_only_tags     = [dict(name='div', attrs={'id':['content']})]
    #I ADDED REMOVE TAGS TO GET RID OF THE COMMENTS AND THE TOOL BAR AT THE TOP
    remove_tags = [dict(name='div', attrs={'id':['tools','story-comments']})] 
                          
    masthead_url = "http://www.deseretnews.com/media/img/icons/dn-masthead-logo.gif"

    feeds          = [(u'Top News', u'http://www.deseretnews.com/home/index.rss'), (u'Utah', u'http://www.deseretnews.com/utah/index.rss'), (u'Movies', u'http://www.deseretnews.com/movies/index.rss'), (u'LDS Newsline', u'http://www.deseretnews.com/ldsnews/index.rss'), (u'Sports', u'http://www.deseretnews.com/sports/index.rss')]

    #I FIXED YOUR INDENT it was all the way to the left it has to be within the class so align it with the indent 
    #of title, remove_javascript, ect...
    
    def print_version(self, url):
        split1 = url.split("/")
        url3 = split1[2]
        url4 = split1[3]
        url5 = split1[4]
        url6 = split1[5]


        #example of link to convert
        #http://www.deseretnews.com/article/700064426/Elizabeth-Smarts-father-joins-bike-ride-to-lobby-for-laws-protecting-children-from-predators.html
        #http://www.deseretnews.com/mobile/article/700064426/Elizabeth-Smarts-father-joins-bike-ride-to-lobby-for-laws-protecting-children-from-predators.html


        print_url = 'http://' + url3 + '/mobile/' + url4 + '/' + url5 + '/' + url6
        #I ADDED THE FOLLOWING TO SHOW YOU IN THE LOG FILE WHAT THE ACTUAL PRINT URL IS.  Once you see it showing the 
        #the currect url then you should be good to go other than just cleaning up a few tags by using keep only and remove
        print 'THIS URL WILL PRINT: ', print_url
        return print_url

willswords · 09-11-2010, 07:11 PM

Awesome! Thanks for your help. This is exactly what I wanted. I really need to learn python so I can understand this better.
You mentioned in the comments that there is a log file... Where does Calibre save that? I poked around but couldn't find one.

TonytheBookworm · 09-11-2010, 07:35 PM

Quote:

Originally Posted by willswords

Awesome! Thanks for your help. This is exactly what I wanted. I really need to learn python so I can understand this better.
You mentioned in the comments that there is a log file... Where does Calibre save that? I poked around but couldn't find one.

I was speaking about the output file that is generated when you test your recipes at the command prompt.

ebook-convert recipename.recipe output_dir --test -vv > myrecipe.txt

for example if your recipe was called billybob.recipe go to the command prompt in windows (cmd) and then find the calibre2 directory which in windows it is generally in c:/program files
then in there type:
ebook-convert billybob.recipe output_dir --test -vv > myrecipe.txt

once the execution finishes then you can open up the myrecipe.txt file in wordpad, notepad, editor whatever and then look for errors or printed text to see if the code is doing what you wish for it to do.

test will take and pull the first 2 articles for you.

somedayson · 09-11-2010, 10:14 PM

Wanted to thank Starson17 and TonytheBookworm for all their help... I still don't fully understand all that I'm doing, but because of your help and the 180 pages on this board, I've got some awesome stuff on my Kindle each day. Thanks to all!

cynvision · 09-11-2010, 10:24 PM

Okay. I've joined the club using .bat files to run the command line so all things are good there. But! I'm working on a little thing that grabs the update from the National Hurricane Center and I thought it was so simple I'm working on it in Calibre's window. I know from the thread here that there's the complicated way of using parse_index or get_links to filter a set of links to follow, but I really wanted to filter by the built-in methods and I must be doing them all wrong because they don't work on my computer.

I just want it to ignore the links ending in '.zip' and '.kmz' Where am I going wrong with raw reg expressions? I've tried with and without "\", with and without "r", and I'm going buggy. (Now I know the text link says .shp, but they're .zip packages that are linked. Right now they or the kmz are being read into the e-book as text.)

Spoiler:

Starson17 · 09-11-2010, 10:33 PM

Quote:

Originally Posted by cynvision

I just want it to ignore the links ending in '.zip' and '.kmz' Where am I going wrong with raw reg expressions?

AFAIK, filter_regexps applies to links in the article, not links in the feed.

cynvision · 09-11-2010, 11:31 PM

Quote:

Originally Posted by somedayson

Further down the list would be trying to find out how to get this column every week:
http://www.aspentimes.com/article/20...ntprofile=1061
thanks for any help anyone can provide....just an awesome program!
Grateful,
Matt

I took a look at this one and if there was a way to get the RSS of the weekly archive... but I don't see one. Maybe it's members only?

TonytheBookworm · 09-12-2010, 12:47 AM

Quote:

Originally Posted by cynvision

I took a look at this one and if there was a way to get the RSS of the weekly archive... but I don't see one. Maybe it's members only?

I didn't see an rss for it so I just parsed the links.
I'm not certain if the link will list the feeds each time for I have no known way of testing that. But the following code will use the link provided by the original poster and then parse the links on that page and look for Alison Berkley in the link text. If it finds it then that link will be used and converted in the print_url to the pretty print version...

Spoiler:

Code:

from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup, re
class FIELDSTREAM(BasicNewsRecipe):
    title      = 'Alison Berkley Column'
    __author__ = 'Tonythebookworm'
    description = 'Some dudes column'
    language = 'en'
    no_stylesheets = True
    publisher           = 'Tonythebookworm'
    category            = 'column'
    use_embedded_content= False
    no_stylesheets      = True
    oldest_article      = 24
    remove_javascript   = True
    remove_empty_feeds  = True
    
    max_articles_per_feed = 10
    INDEX = 'http://www.aspentimes.com'
    
    
    def parse_index(self):
        feeds = []
        for title, url in [
                            (u"Alison Berkley", u"http://www.aspentimes.com/SECTION/&Profile=1021&ParentProfile=1061"),
                            
                            
                             ]:
            articles = self.make_links(url)
            if articles:
                feeds.append((title, articles))
        return feeds
        
    def make_links(self, url):
        title = 'Temp'
        current_articles = []
        soup = self.index_to_soup(url)
        print 'The soup is: ', soup
        for item in soup.findAll('div',attrs={'class':'title'}):
            print 'item is: ', item
            link = item.find('a')
            print 'the link is: ', link
            titlecheck = self.tag_to_string(link)
            #once we get a link we need to check to see if it contains Alison Berkley and if it does use it
            if link.find(text=re.compile('Alison Berkley')) :
                print 'FOUND TITLE AND IT IS : ', titlecheck
            
                url         = self.INDEX + link['href']
                title       = self.tag_to_string(link)
                print 'the title is: ', title
                print 'the url is: ', url
                print 'the title is: ', title
                current_articles.append({'title': title, 'url': url, 'description':'', 'date':''}) # append all this
        return current_articles
        
        
    def print_version(self, url):
        split1 = url.split("article")
        print 'THE SPLIT IS: ', split1 
        #original is: http://www.aspentimes.com/article/20100909/COLUMN/100909869/1021&parentprofile=1061
        #need this to be print_url:
        #http://www.aspentimes.com/apps/pbcs.dll/article?AID=/20100909/COLUMN/100909869/1021&parentprofile=1061&template=printart         
         
        print_url = 'http://www.aspentimes.com/apps/pbcs.dll/article?AID=' + split1[1] + '&template=printart'
        print 'THIS URL WILL PRINT: ', print_url # this is a test string to see what the url is it will return
        return print_url

cynvision · 09-12-2010, 01:51 AM

Quote:

Originally Posted by Lukas238

I was trying to create a recipe for http://www.heavens-above.com, excellent Astronomy website where you can see the predictions of satellite positions, as will be seen from your city.

The problem is that the site offers no feeds.

Nonetheless, I could create something pretty close. I could download some content, but appear as html code, not as text.

This is the recipe as far as it went. The user login is not required to access any of the pages, but if you logger all pages display the information as seen from your city.

Thanks for trying this one out. I've been missing this since AvantGo packed up and went out of business. One thing that better brains on Calibre might have input on is how to capture the "Session" variable after login and use it in the URL's. I'm thinking using an old value in the feed probably gives their server a complex.

Now, cleaning up what comes back is a bit daunting. If my results equal yours, all I'm getting is a text string of the table served back to Calibre and that's what's in the page for each feed.
I wanted this to work so bad I spent over an hour and didn't get very far.

cynvision · 09-12-2010, 02:03 AM

Quote:

Originally Posted by TonytheBookworm

I didn't see an rss for it so I just parsed the links.
I'm not certain if the link will list the feeds each time for I have no known way of testing that. But the following code will use the link provided by the original poster and then parse the links on that page and look for Alison Berkley in the link text. If it finds it then that link will be used and converted in the print_url to the pretty print version...

Ah yes. I'm still not comfortable with how the multiple page link following works. You'd have to follow the 'more articles' link at least once to get more than one article from that author.

TonytheBookworm · 09-12-2010, 02:49 AM

Quote:

Originally Posted by cynvision

Ah yes. I'm still not comfortable with how the multiple page link following works. You'd have to follow the 'more articles' link at least once to get more than one article from that author.

I'll work on it tomorrow after the football games go off. I think I know how to solve it. Look at the down to earth recipe I posted. I think the same idea would work for this.
Basically take and parse the page for href's and put them in the soup. Then turn around and take that soup and parse it again and search using the re.compile that I already have in this. If you can't figure it out I'll see what I can come up with tomorrow. But it is 3am where I'm at and i'm tired zzzzzzzz

09-11-2010, 07:11 PM	#2691
willswords Junior Member Posts: 3 Karma: 10 Join Date: Sep 2010 Device: Kindle 3	Awesome! Thanks for your help. This is exactly what I wanted. I really need to learn python so I can understand this better. You mentioned in the comments that there is a log file... Where does Calibre save that? I poked around but couldn't find one. Last edited by willswords; 09-11-2010 at 07:20 PM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Custom column read ?	pchrist7	Calibre	2	10-04-2010 02:52 AM
Archive for custom screensavers	sleeplessdave	Amazon Kindle	1	07-07-2010 12:33 PM
How to back up preferences and custom recipes?	greenapple	Calibre	3	03-29-2010 05:08 AM
Donations for Custom Recipes	ddavtian	Calibre	5	01-23-2010 04:54 PM
Help understanding custom recipes	andersent	Calibre	0	12-17-2009 02:37 PM

09-11-2010, 06:11 AM	#2686
poloman Enthusiast Posts: 25 Karma: 10 Join Date: Nov 2008 Device: PRS505, Kindle 3G	just wanted to say thanks to Tony for the tip on the Slashdot feed - works perfectly and fits my "clipping to read the full story later" workflow - thanks! I'm going to use my new knowledge of recipes to improve the register recipe as it is all right aligned - havent tackled css until now, so should be fun!

09-11-2010, 10:14 PM	#2693
somedayson Member Posts: 13 Karma: 10 Join Date: Sep 2010 Device: K3	Wanted to thank Starson17 and TonytheBookworm for all their help... I still don't fully understand all that I'm doing, but because of your help and the 180 pages on this board, I've got some awesome stuff on my Kindle each day. Thanks to all!