Custom recipes (archive, read-only) - Page 175

kovidgoyal · 09-03-2010, 12:12 PM

@JvdW: nrcnext uses the parse index function to get a list of articles and the website has changed, so it fails. Unfortunately, as I don't read Dutch, it's hard for me to fix.

TonytheBookworm · 09-03-2010, 01:50 PM

If I have asked this before please forgive me but I can't remember how.

If I have a rss feed that shows some linkes and the likes are like :
http://www.nfl.com/goto?id=09000d5d81a38fd4
but that link gets automatically changed to something like this when the article loads.
http://www.nfl.com/preseason/story/0...ffers-torn-mcl

How the heck could I get the url that is produced when the article loads? Cause to get the print version all I need to do is

Code:

 def print_version(self, url):
        print_url = url.replace('/article/', '/printable')
        print 'THE PRINTABLE URL IS: ', print_url
        return print_url

which would give me
http://www.nfl.com/preseason/story/0...ffers-torn-mcl

but instead i get:
http://www.nfl.com/goto?id=09000d5d81a38fd4

thanks. And starson17 I'm doing like you said and making me one big template with comments and all of how to do certain things so I can cut and paste the "tricks" thanks for the advice

Starson17 · 09-03-2010, 02:53 PM

Quote:

Originally Posted by TonytheBookworm

If I have a rss feed that shows some linkes ...
but that link gets automatically changed...
How the heck could I get the url that is produced when the article loads? Cause to get the print version all I need to do is

When you have an RSS feed that redirects you or sends you to an article link that's not easily figured out from the RSS link, you have two basic solutions.

The first is to skip the idea of getting the print version. Just use keep_only and remove_tags, etc. to keep what you want from the main non-print article. That's my preferred solution.
The other is to treat the link as being obfuscated.

TonytheBookworm · 09-03-2010, 05:19 PM

Quote:

Originally Posted by Starson17

When you have an RSS feed that redirects you or sends you to an article link that's not easily figured out from the RSS link, you have two basic solutions.

The first is to skip the idea of getting the print version. Just use keep_only and remove_tags, etc. to keep what you want from the main non-print article. That's my preferred solution.
The other is to treat the link as being obfuscated.

Thanks. I guess that is all the fun in this. Some of the feeds are hard as crap to figure out then others are easy. I think the easy ones tend to be more designed by professionals that actually take the time to follow general organizational patterns that's my thought. But in some cases I could be simply a sight trying to make it impossible to parse... Anyway thanks again and I did do that stuff in ultraedit and I love it. What I do is keep the actual myrecipe.txt open and then when i run the batch it tells me that the myrecipe.txt has been modified so i hit yes and see the changes. I really find that to be great. And I have also used the search feature to find where others did things like splits and remove and so on.

TonytheBookworm · 09-03-2010, 07:01 PM

if you were faced with something like this how would you remove it?
take a look at link : http://www.nfl.com/gamecenter/201009...cap/full-story
notice it has the fantasy football in it..

Spoiler:

Code:

<div style="">
<div style="margin: 5px; font-size: 11px; float: right; padding: 10px; background-color: rgb(255, 255, 255); border: 1px solid rgb(204, 204, 204); font-family: arial; width: 255px;">
<table>
<tbody><tr>
<td>
</td>
</tr>
<tr>
<td>
<a href="http://fantasy.nfl.com/" onclick="s_objectID=&quot;http://fantasy.nfl.com/_3&quot;;return this.s_oc?this.s_oc(e):true"><img border="0" class="teamslandinggrid" src="http://static.nfl.com/static/content/catch_all/nfl_image/Fantasy_250x150_1.jpg"></a>
</td>
</tr>
<tr>
<td colspan="3">
<br><b>2010 NFL.com fantasy football games</b>
<br>» <a href="http://fantasy.nfl.com/registration/privateleaguecreate" onclick="s_objectID=&quot;http://fantasy.nfl.com/registration/privateleaguecreate_1&quot;;return this.s_oc?this.s_oc(e):true">Create a customizable league</a>
<br>» <a href="http://fantasy.nfl.com/registration/leagueDirectory?leagueType=private" onclick="s_objectID=&quot;http://fantasy.nfl.com/registration/leagueDirectory?leagueType=private_1&quot;;return this.s_oc?this.s_oc(e):true">Join a custom private league</a>
<br>» <a href="http://fantasy.nfl.com/registration/leagueDirectory" onclick="s_objectID=&quot;http://fantasy.nfl.com/registration/leagueDirectory_1&quot;;return this.s_oc?this.s_oc(e):true">Join an NFL-managed league</a>
<br>» <a href="http://fantasy.nfl.com/draftcenter/mockdrafts" onclick="s_objectID=&quot;http://fantasy.nfl.com/draftcenter/mockdrafts_1&quot;;return this.s_oc?this.s_oc(e):true">Join a 10-team mock draft</a>
<br>
<br><b>Dominate your fantasy football draft!</b>
<br>» <a href="http://www.nfl.com/fantasy/draftkit" onclick="s_objectID=&quot;http://www.nfl.com/fantasy/draftkit_1&quot;;return this.s_oc?this.s_oc(e):true">NFL.com's in-depth draft kit</a>
<br>» <a href="http://www.nfl.com/fantasy/rankings" onclick="s_objectID=&quot;http://www.nfl.com/fantasy/rankings_1&quot;;return this.s_oc?this.s_oc(e):true">2010 fantasy player rankings</a>
<br>» <a href="http://www.nfl.com/goto?id=09000d5d817fb977" onclick="s_objectID=&quot;http://www.nfl.com/goto?id=09000d5d817fb977_1&quot;;return this.s_oc?this.s_oc(e):true">Complete profiles/projections</a>
<br>» <a href="http://www.nfl.com/fantasy" onclick="s_objectID=&quot;http://www.nfl.com/fantasy_1&quot;;return this.s_oc?this.s_oc(e):true">NFL.com Fantasy home page</a>
<br>
</td>
</tr>
</tbody></table>
</div></div>

I've tried doing a

Code:

remove_tags =[dict(attrs={'style':[""]})]

I even tried

Code:

def postprocess_html(self, soup):
         for tag in soup.findAll(attrs ={'style':[' ']}):
             tag.extract()
         return soup

all with no sucess. Am I just picking hard stuff to figure out or just common problems with someone just learning this stuff?

Starson17 · 09-03-2010, 08:03 PM

Quote:

Originally Posted by TonytheBookworm

if you were faced with something like this how would you remove it?

How about remove all <table> tags?

If that's too much, you could search to see if the table tag has a fantasy football link in it, and extract it only if it does.

You can do search and replace, etc.

I'd say they are "common problems with someone just learning this stuff?"

TonytheBookworm · 09-03-2010, 09:19 PM

Quote:

Originally Posted by Starson17

How about remove all <table> tags?

If that's too much, you could search to see if the table tag has a fantasy football link in it, and extract it only if it does.

You can do search and replace, etc.

I'd say they are "common problems with someone just learning this stuff?"

What might be wrong with this? the removing of the whole table was a tad much. But i do notice they have the fantasy football link and the auto pass links in them. I honestly don't see why this doesn't work

haha I hate being a noob and asking so many questions. Really I do cause I feel like I'm annoying people. Anyway here is what I have...

Spoiler:

my understanding of the above is it should find all instances of the <table> tag and then take and look inside that for the https and http links specified. If it finds either of them it will extract it from the soup. otherwise it will continue on. then return the soup without those links yet that doesn't happen

TonytheBookworm · 09-04-2010, 12:10 AM

New Recipe for Georgia Outdoor News.
Only issue with this is some of the links do not have actual titles but simply the text states Read More. If anyone cares to fix that feel free. This version only includes a print_version() of the page (aka without the pretty pictures). I might update it in the future to include the pics from the non print_version.
I didn't do the entire page only the hunting section for deer, waterfowl, wild life management, and then fishing for bass trout and fishing & lake reports. Enjoy.

P.S. When loaded on the kindle 2 it seems to cut the text off on the right hand side. I don't know if this is a bug because i seen something similar posted in the bug reports for calibre. But it appears the content is within a table and the user is forced to pan. Maybe someone can help me figure this issue out. Thanks

poloman · 09-04-2010, 06:04 AM

Hi! I'd like to learn how do do some feeds - I've read the tutorial and the site I'm after doesn't quite work - are there any tips/examples for feed burner based feeds?

Ideally, I'd like to create a recipe for The Daily Mash : http://feeds.feedburner.com/thedailymash

Thanks for any help you can give!

Starson17 · 09-04-2010, 10:23 AM

Quote:

Originally Posted by TonytheBookworm

What might be wrong with this?

Code:

def preprocess_html(self, soup):
       for article in table.findAll('table') :

my understanding of the above is it should find all instances of the <table> tag and then take and look inside that for the https and http links specified. If it finds either of them it will extract it from the soup. otherwise it will continue on. then return the soup without those links yet that doesn't happen

It should be

Code:

def preprocess_html(self, soup):
       for article in soup.findAll('table') :

Otherwise, you are looking for table tags inside "table"

Starson17 · 09-04-2010, 10:28 AM

Quote:

Originally Posted by TonytheBookworm

it appears the content is within a table and the user is forced to pan. Maybe someone can help me figure this issue out. Thanks

I remove all tables in recipes. They tend to cause trouble.

You can use:
conversion_options = {'linearize_tables' : True}
or something like:

Code:

    def postprocess_html(self, soup, first_fetch):
        for t in soup.findAll(['table', 'tr', 'td']):
            t.name = 'div'

TonytheBookworm · 09-04-2010, 12:40 PM

Quote:

Originally Posted by Starson17

I remove all tables in recipes. They tend to cause trouble.

You can use:
conversion_options = {'linearize_tables' : True}
or something like:

Code:

    def postprocess_html(self, soup, first_fetch):
        for t in soup.findAll(['table', 'tr', 'td']):
            t.name = 'div'

AWWW!!! So that is why that was in there. I seen that in one of the other recipes but wasn't sure why it was there. Let me understand and correctly if I'm wrong. In the postprocess it is finding all instances of the table tr and td and then changing their name to div or making them div tags if you will.. One last thing while on the subject. I wasn't too clear on the postprocess_html parameters. It takes 3 arguments. The first 2 I understand but I'm confused about the first_fetch cause in some recipes I noticed they use first. So are these reserved words and if so what do they do exactly? Thanks again. Learning so much from you!!!

TonytheBookworm · 09-04-2010, 12:49 PM

Quote:

Originally Posted by poloman

Hi! I'd like to learn how do do some feeds - I've read the tutorial and the site I'm after doesn't quite work - are there any tips/examples for feed burner based feeds?

Ideally, I'd like to create a recipe for The Daily Mash : http://feeds.feedburner.com/thedailymash

Thanks for any help you can give!

Poloman, my tip is this. And I don't mean to come across as rude by saying this. Do like I'm doing. Jump in head first with it. Even though I have programmed in C# for years, the python scripting is different for me. None the less, take and look at the recipes that are already provided. On a windows based system they are in /program files/calibre2/resources/recipes (or along that path).

First when you get it pulling the feed, then you will be hey that's not how i want it to look. So then you do like I did and go hmmm how do I remove the stuff. So i started doing a search in the recipes for remove and came across remove_tags and remove_tags_after and so on. Then also keep_only. I then took and tried those methods and if they worked I patted myself on the back and if they didn't then i took and posted segments of my code or in some cases the whole code in spoiler and code tags and the good folks on this site will generally help you out in a timely manner given you put for the effort. I know Starson17 has helped me big time along with a few others..

Bottom line is yes it is complicated to learn (heck i'm still figuring it out), but once you start to get the basics. You develop and arsenal to attack almost any feed you are faced with.

I for one feel defeated when I work on something for hours and then someone comes along instead of explaining what they done and simply doing it. Yes I'm grateful that they do that, yet on the same token I feel let down because I haven't learned anything..

So give it a try and let us know where we can help.

Here take a look at this to give you an idea... This should work for you but read the comments in it so you can get a understand of how i went about it. The only thing that I can't figure out on this is how to remove the style tags to get rid of the digg links and so forth at the bottom..

Spoiler:

Code:

from calibre.web.feeds.news import BasicNewsRecipe

class AdvancedUserRecipe1282101454(BasicNewsRecipe):
    title = 'Daily Mash'
    language = 'en'
    __author__ = 'TonytheBookworm'
    description = 'The Daily Mash'
    publisher = 'Tony Stegall'
    category = ''
    oldest_article = 7
    conversion_options = {'linearize_tables' : True}
    max_articles_per_feed = 100
    no_stylesheets = True
    
    masthead_url = 'http://www.thedailymash.co.uk/images/mashlogo5.gif'
    
      
   
    
    
    feeds          = [
                      ('Daily Mash', 'http://feeds.feedburner.com/thedailymash'),
                      
                    ]


    

    def print_version(self,url):
        split1 = url.split("?") # I take and search for all instances of ? in the url
        split2 = split1[1] # I then need to find the second part of the url to get what i need. it is 0 based index
        print 'THE SPLIT IS :', split2   # this is used to test to see what the results of the split is 
        
        #-----------------------------------------------------------------------------------------------
        #- This is how the orginal url comes in and how it needs to be converted to get a print version-
        #-----------------------------------------------------------------------------------------------
        
        #example of link to convert
        #Original link: http://www.thedailymash.co.uk/index.php?option=com_content&task=view&id=3060&Itemid=74
        #print version: http://www.thedailymash.co.uk/index2.php?option=com_content&task=view&id=3060&pop=1&page=0&Itemid=74
        
        #Now that I have my splits I take and piece it together
        #1) I take and have a constant url of www.thedailymash.co.uk/index2.php
        #2) I then want to take and append my split to the end of it
        #3) I then take and add the &page=0&pop=1 to the end 
        #4) I then get my needed url be in print format
        
        print_url = 'http://www.thedailymash.co.uk/index2.php?'+ split2 + '&page=0&pop=1'
        print 'print_url is: ', print_url
        return print_url

TonytheBookworm · 09-04-2010, 01:20 PM

There was a typo pointed out in the West Hawaii Today online feed where the local feed didn't have a , in it.
Here is the updated version with the comma in it.

Starson17 · 09-04-2010, 03:48 PM

Quote:

Originally Posted by TonytheBookworm

it is finding all instances of the table tr and td and then changing their name to div or making them div tags if you will..

Exactly correct.

Quote:

One last thing while on the subject. I wasn't too clear on the postprocess_html parameters. It takes 3 arguments. The first 2 I understand but I'm confused about the first_fetch cause in some recipes I noticed they use first. So are these reserved words and if so what do they do exactly? Thanks again. Learning so much from you!!!

So you want all the secrets eh? I quote: "first_fetch – True if this is the first page of an article."

You probably haven't used it much, but there is a recursion parameter that causes the recipe to follow links. The result is that links on the article page are fetched and work within the ebook. (By default it's off so links aren't followed/fetched).

I have a recipe of food recipes. The main food recipe on page 1 of the article may have a link to another food recipe, like a sauce or a side dish. I have recursion turned on to fetch those related recipes. First_fetch is true only on the first page.

09-03-2010, 01:50 PM	#2612
TonytheBookworm Addict Posts: 264 Karma: 62 Join Date: May 2010 Device: kindle 2, kindle 3, Kindle fire	If I have asked this before please forgive me but I can't remember how. If I have a rss feed that shows some linkes and the likes are like : http://www.nfl.com/goto?id=09000d5d81a38fd4 but that link gets automatically changed to something like this when the article loads. http://www.nfl.com/preseason/story/0...ffers-torn-mcl How the heck could I get the url that is produced when the article loads? Cause to get the print version all I need to do is Code: def print_version(self, url): print_url = url.replace('/article/', '/printable') print 'THE PRINTABLE URL IS: ', print_url return print_url which would give me http://www.nfl.com/preseason/story/0...ffers-torn-mcl but instead i get: http://www.nfl.com/goto?id=09000d5d81a38fd4 thanks. And starson17 I'm doing like you said and making me one big template with comments and all of how to do certain things so I can cut and paste the "tricks" thanks for the advice

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Custom column read ?	pchrist7	Calibre	2	10-04-2010 02:52 AM
Archive for custom screensavers	sleeplessdave	Amazon Kindle	1	07-07-2010 12:33 PM
How to back up preferences and custom recipes?	greenapple	Calibre	3	03-29-2010 05:08 AM
Donations for Custom Recipes	ddavtian	Calibre	5	01-23-2010 04:54 PM
Help understanding custom recipes	andersent	Calibre	0	12-17-2009 02:37 PM

09-03-2010, 12:12 PM	#2611
kovidgoyal creator of calibre Posts: 45,374 Karma: 27230406 Join Date: Oct 2006 Location: Mumbai, India Device: Various	@JvdW: nrcnext uses the parse index function to get a list of articles and the website has changed, so it fails. Unfortunately, as I don't read Dutch, it's hard for me to fix.

09-04-2010, 06:04 AM	#2619
poloman Enthusiast Posts: 25 Karma: 10 Join Date: Nov 2008 Device: PRS505, Kindle 3G	Hi! I'd like to learn how do do some feeds - I've read the tutorial and the site I'm after doesn't quite work - are there any tips/examples for feed burner based feeds? Ideally, I'd like to create a recipe for The Daily Mash : http://feeds.feedburner.com/thedailymash Thanks for any help you can give!

Advert

Advert