Custom recipes (archive, read-only) - Page 189

Starson17 · 09-22-2010, 03:05 PM

Quote:

Originally Posted by Thetasquared

It worked! downloaded the issue and it looks fine in Calibre (not near my ereader right now). I know next to nothing about coding, so I don't know what to clean up. Only thing I noticed is that the words "Fire & Ice" are in the code. Does that mean if the current issue changes it will still download the "fire & ice" issue?

I love Science News too! Thank you for your time and effort in making this recipe work. It will add tons of enjoyment to my life! umm.. and knowledge. will increase my knowledge of science stuff :-)

You can delete the fire and ice line that starts with #. It's a comment to remind me of the structure of the links needed to be recursed. Next I'll look it over closely, then submit a final version to Kovid. I may want to do a little cleaning, and I need to credit the original author(s) of the previous ScienceNews recipe. I expected to have to write it from scratch, but I was able to use most of the earlier recipe.

TonytheBookworm · 09-22-2010, 04:36 PM

Quote:

Originally Posted by Starson17

1) It has class="cd_mainarticle", not class="cdmainarticle",
2) It has inline style on your header. Strip that first:

thanks. I was looking at the page (after it ran through the recipe, DUH HAHA) that is why i had cdmainarticle but i went back and looked at the original page and went well duh there it is clear as day with the _

thanks again

krunk · 09-22-2010, 04:38 PM

Quote:

Originally Posted by Starson17

AFAIK, it appears in stylesheet.css.

They're not appearing in the sylesheet.css either (the recursive grep would catch it) but i'll do a deeper inspection there.

Quote:

Originally Posted by Starson17

You need indents in the extra_css, just like elsewhere, or they'll get ignored.

Is this a quirk of the calibre library? It's not a python syntax rule. A python syntax error would also throw an exception.

Code:

>>> class A(object):
...     foo = """
... This is a docstring. No 
... indents necessary within the block.
... """
... 
>>> a = A()
>>> a.foo
'\nThis is a docstring. No \nindents necessary within the block.\n'

Starson17 · 09-22-2010, 04:46 PM

Quote:

Originally Posted by krunk

They're not appearing in the sylesheet.css either (the recursive grep would catch it) but i'll do a deeper inspection there.

Yes, the grep should have caught it if it was there. I'd make sure you strip out the stylesheet and the internal style attributes. That's usually the problem with extra_css not showing up. Examine the page of interest, see if it has internal styles, and if so, try:

Code:

    def preprocess_html(self, soup):
        for item in soup.findAll(attrs={'style':True}):
            del item['style']
        return soup

TonytheBookworm · 09-22-2010, 05:12 PM

Quote:

Originally Posted by Starson17

Here's a quick and dirty version. Why don't you look it over and spot what needs to get cleaned up better. Post here and I'll address it. I really like Science News.

cool I just learned something from you. the match_regex is great. I would have done that with the make_links() like you showed me in the past. but i seen the match_regex and was wondering okay what the heck does this do. then i see well cool he looks at the page and fines those links and follows only those links. thanks for using that

Flexicat · 09-22-2010, 07:36 PM

Quote:

Originally Posted by Starson17

I'll take a look at it. IIRC, I spotted this error a while back and wrote some code to bypass it, but didn't see anyone complaining and never got around to uploading it. I'll hunt it up and post it. It's not you.

Edit: I checked and apparently, I did upload the revised recipe. I tested the current built in and it works fine. Are you perhaps using an earlier version that I uploaded here? If you are, switch to the built in that is now supplied with Calibre. The error you are getting looks like the error from the earlier version.

Thank you Starson17, that was the issue.

bhandarisaurabh · 09-22-2010, 08:42 PM

Quote:

Originally Posted by TonytheBookworm

alright, first lets not piggyback but yet make our own version since the feeds are different and all. With that being said, I had to test the code to get it correct because the index on the split is 0 based. and also the very last index was blank so even though lets say the length of the split array was 8 then the id would be in the 6th position. so i just idnum = len(split1) -2
anyway this code works.

Spoiler:

Code:

from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup, re
class AdvancedUserRecipe1282101454(BasicNewsRecipe):
    title = 'Business Standard modified'
    language = 'en'
    __author__ = 'TonytheBookworm'
    description = 'Business Standard modified'
    publisher = 'Business Standard'
    category = ''
    oldest_article = 5
    max_articles_per_feed = 100
    no_stylesheets = True
    #extra_css = '.headline {font-size: x-large;} \n .fact { padding-top: 10pt }'
    #masthead_url = 'http://gawand.org/wp-content/uploads/2010/06/ajc-logo.gif'
    #keep_only_tags    = [
     #                    dict(name='div', attrs={'class':['blogEntryHeader','blogEntryContent']})
      #                 ,dict(attrs={'id':['cxArticleText','cxArticleBodyText']})
      #                  ]
    feeds = [
             (u'Todays Newspaper',u'http://feeds.business-standard.com/rss/paper.xml'),
             (u'Banking & finance',u'http://feeds.business-standard.com/rss/1.xml'),
             (u'Companies & Industry', u'http://feeds.business-standard.com/rss/2.xml'),
             (u'Economy & Policy'    , u'http://feeds.business-standard.com/rss/3.xml'),
             (u'Opinion and analysis', u'http://feeds.business-standard.com/rss/5_0.xml'),
             (u'Life & Leisure'      , u'http://feeds.business-standard.com/rss/6_0.xml'),
             (u'Markets & Investing' , u'http://feeds.business-standard.com/rss/12.xml'),
             (u'Management & Mktg'   , u'http://feeds.business-standard.com/rss/7_0.xml'),
             (u'Tech World',u'http://feeds.business-standard.com/rss/8_0.xml'),
            ]
    def print_version(self, url):
        split1 = url.split("/")
        print 'ORG URL IS: ', url
        id = len(split1)-2 # had to offset it by 2 because it is 0 based and also the last index is blank 
        idnum = split1[id] # get the actual value of the id article
        print 'the idnum is: ', idnum
        print_url = 'http://www.business-standard.com/india/printpage.php?autono=' + idnum + '&tp='
        print 'PRINT URL IS: ', print_url
        return print_url

thanks a ton its working fine

bhandarisaurabh · 09-22-2010, 08:43 PM

there is already a recipe for foreign policy but it covers rss feeds can anyone make the recipe for print edition
http://www.foreignpolicy.com/issues/current
thanks in advance

jenden · 09-22-2010, 10:05 PM

Could you please create a Kindle recipe for the french version of the Jerusalem post.
http://fr.jpost.com/

Thanks

TonytheBookworm · 09-22-2010, 11:36 PM

Quote:

Originally Posted by bhandarisaurabh

there is already a recipe for foreign policy but it covers rss feeds can anyone make the recipe for print edition
http://www.foreignpolicy.com/issues/current
thanks in advance

I know this is gonna sound rude but just curious why can't you try to do it now?
I know I have personally done 5 or better recipes for you and gave you detailed tips on how to do it. I have no problem what so ever helping you and I think I speak for the rest of us here when I say give it a try. Post some of your code, ask specific questions, search the built in recipes, search this forum starting with my first post and work yourself foward. I didn't know anything about this at all other than the fact that it could be done with a will. So, may I suggest trying to learn how to do it so you can join us in making calibre better by contributing your recipes. I hope you understand where I'm coming from and hate to sound off base.
Once again, I am here to help and wouldn't know what I know without the help of others. Yet, the only way you are gonna learn this stuff is by doing it

rayh · 09-23-2010, 05:35 AM

Is there any chance someone could produce a recipe for an Melbourne, Australian newspaper called Herald Sun.

Thanks Ray

bhandarisaurabh · 09-23-2010, 06:39 AM

Quote:

Originally Posted by TonytheBookworm

I know this is gonna sound rude but just curious why can't you try to do it now?
I know I have personally done 5 or better recipes for you and gave you detailed tips on how to do it. I have no problem what so ever helping you and I think I speak for the rest of us here when I say give it a try. Post some of your code, ask specific questions, search the built in recipes, search this forum starting with my first post and work yourself foward. I didn't know anything about this at all other than the fact that it could be done with a will. So, may I suggest trying to learn how to do it so you can join us in making calibre better by contributing your recipes. I hope you understand where I'm coming from and hate to sound off base.
Once again, I am here to help and wouldn't know what I know without the help of others. Yet, the only way you are gonna learn this stuff is by doing it

okay I will try it I am not a programmer but I will try to understand can you give me some link from where I can learn about parsing the whole page

Starson17 · 09-23-2010, 07:45 AM

Quote:

Originally Posted by TonytheBookworm

cool I just learned something from you. the match_regex is great. I would have done that with the make_links() like you showed me in the past. but i seen the match_regex and was wondering okay what the heck does this do. then i see well cool he looks at the page and fines those links and follows only those links. thanks for using that

Note that first I turned on recursion. The match_regex is to prevent recursion from crawling all over the web to unrelated places.

TonytheBookworm · 09-23-2010, 12:38 PM

Quote:

Originally Posted by rayh

Is there any chance someone could produce a recipe for an Melbourne, Australian newspaper called Herald Sun.

Thanks Ray

Please goto http://www.heraldsun.com.au/help/rss and tell me which feeds you would like and I will work on it for you.

I will do the breaking news feed for now and await your reply.

Edit: I went ahead and done the whole thing. I commented out the AFL teams and you can pick whichever one you like.

TonytheBookworm · 09-24-2010, 01:45 AM

Starson17,
I need your help on this one if you gotta minute. I have been battling this feed which I would figure would be simple to do. But for some reason it is giving me trouble even with the basic. If i take the keep_only tag out it will work but of course I want to use that to get rid of the ads and all the other junk.
I have tried every dang tag I can think of by trying to filter it with firebug. This is what i have come up with so far. Basic for sure but I get no content when I keep only the tag that appears to be the parent. HELP

here is what I got so far. If you can just help me with the keep_only I think I can figure out the rest unless there is something screwy that I have never faced before going on here.
Here is what i have so far and thanks.

Spoiler:

edit:
alright I got it working but i'm confused on this. In previous feeds I have done i enter the feed address and it gets the link and uses it as the title and then the content that is listed under it parses part of it and uses it as a description. Well in this feed here the content is all on the feed page so it doesn't go to the actual link. In the code above I was assuming that it went to the links one by one inside the feed. I was trying to strip the content that the link showed.
So my question to you is, what determines if it uses the feed main page content (the one that has all the links on it) or if it navigates to each link? I hope you understand what I'm asking if not i will try to explain myself better.
this code here works cause for whatever reason the links on the feed page are not followed. but in other basic feeds i have simply done nothing more than add the feed and it follows the link

Spoiler:

09-22-2010, 10:05 PM	#2829
jenden Junior Member Posts: 4 Karma: 10 Join Date: Sep 2010 Device: kindle dx	Could you please create a Kindle recipe for the french version of the Jerusalem post. http://fr.jpost.com/ Thanks Last edited by jenden; 09-22-2010 at 11:06 PM.

09-24-2010, 01:45 AM	#2835
TonytheBookworm Addict Posts: 264 Karma: 62 Join Date: May 2010 Device: kindle 2, kindle 3, Kindle fire	Starson17, I need your help on this one if you gotta minute. I have been battling this feed which I would figure would be simple to do. But for some reason it is giving me trouble even with the basic. If i take the keep_only tag out it will work but of course I want to use that to get rid of the ads and all the other junk. I have tried every dang tag I can think of by trying to filter it with firebug. This is what i have come up with so far. Basic for sure but I get no content when I keep only the tag that appears to be the parent. HELP here is what I got so far. If you can just help me with the keep_only I think I can figure out the rest unless there is something screwy that I have never faced before going on here. Here is what i have so far and thanks. Spoiler: Code: from calibre.web.feeds.news import BasicNewsRecipe from calibre.ebooks.BeautifulSoup import BeautifulSoup, re class AdvancedUserRecipe1282101454(BasicNewsRecipe): title = 'How To Geek' language = 'en' __author__ = 'TonytheBookworm' description = 'Daily Computer Tips and Tricks' publisher = 'Howtogeek' category = 'PC,tips,tricks' oldest_article = 2 max_articles_per_feed = 100 linearize_tables = True no_stylesheets = True remove_javascript = True keep_only_tags = [ dict(name='div', attrs={'class':['yui-u']}) ] feeds = [ ('Tips', 'http://feeds.howtogeek.com/howtogeek') ] edit: alright I got it working but i'm confused on this. In previous feeds I have done i enter the feed address and it gets the link and uses it as the title and then the content that is listed under it parses part of it and uses it as a description. Well in this feed here the content is all on the feed page so it doesn't go to the actual link. In the code above I was assuming that it went to the links one by one inside the feed. I was trying to strip the content that the link showed. So my question to you is, what determines if it uses the feed main page content (the one that has all the links on it) or if it navigates to each link? I hope you understand what I'm asking if not i will try to explain myself better. this code here works cause for whatever reason the links on the feed page are not followed. but in other basic feeds i have simply done nothing more than add the feed and it follows the link Spoiler: Code: from calibre.web.feeds.news import BasicNewsRecipe from calibre.ebooks.BeautifulSoup import BeautifulSoup, re class AdvancedUserRecipe1282101454(BasicNewsRecipe): title = 'How To Geek' language = 'en' __author__ = 'TonytheBookworm' description = 'Daily Computer Tips and Tricks' publisher = 'Howtogeek' category = 'PC,tips,tricks' oldest_article = 2 max_articles_per_feed = 100 linearize_tables = True no_stylesheets = True remove_javascript = True remove_tags =[dict(name='a', attrs={'target':['_blank']}), dict(name='table', attrs={'id':['articleTable']}), dict(name='div', attrs={'class':['feedflare']}), ] feeds = [ ('Tips', 'http://feeds.howtogeek.com/howtogeek') ] Last edited by TonytheBookworm; 09-24-2010 at 02:03 AM. Reason: confused see addition

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Custom column read ?	pchrist7	Calibre	2	10-04-2010 02:52 AM
Archive for custom screensavers	sleeplessdave	Amazon Kindle	1	07-07-2010 12:33 PM
How to back up preferences and custom recipes?	greenapple	Calibre	3	03-29-2010 05:08 AM
Donations for Custom Recipes	ddavtian	Calibre	5	01-23-2010 04:54 PM
Help understanding custom recipes	andersent	Calibre	0	12-17-2009 02:37 PM

09-22-2010, 08:43 PM	#2828
bhandarisaurabh Enthusiast Posts: 49 Karma: 10 Join Date: Aug 2009 Device: none	there is already a recipe for foreign policy but it covers rss feeds can anyone make the recipe for print edition http://www.foreignpolicy.com/issues/current thanks in advance

09-23-2010, 05:35 AM	#2831
rayh Member Posts: 24 Karma: 10 Join Date: Mar 2010 Location: Australia Device: Kindle latest Generation	Is there any chance someone could produce a recipe for an Melbourne, Australian newspaper called Herald Sun. Thanks Ray