Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Closed Thread
 
Thread Tools Search this Thread
Old 09-03-2010, 12:12 PM   #2611
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 45,209
Karma: 27110894
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
@JvdW: nrcnext uses the parse index function to get a list of articles and the website has changed, so it fails. Unfortunately, as I don't read Dutch, it's hard for me to fix.
kovidgoyal is offline  
Old 09-03-2010, 01:50 PM   #2612
TonytheBookworm
Addict
TonytheBookworm is on a distinguished road
 
TonytheBookworm's Avatar
 
Posts: 264
Karma: 62
Join Date: May 2010
Device: kindle 2, kindle 3, Kindle fire
If I have asked this before please forgive me but I can't remember how.

If I have a rss feed that shows some linkes and the likes are like :
http://www.nfl.com/goto?id=09000d5d81a38fd4
but that link gets automatically changed to something like this when the article loads.
http://www.nfl.com/preseason/story/0...ffers-torn-mcl

How the heck could I get the url that is produced when the article loads? Cause to get the print version all I need to do is
Code:
 def print_version(self, url):
        print_url = url.replace('/article/', '/printable')
        print 'THE PRINTABLE URL IS: ', print_url
        return print_url
which would give me
http://www.nfl.com/preseason/story/0...ffers-torn-mcl

but instead i get:
http://www.nfl.com/goto?id=09000d5d81a38fd4

thanks. And starson17 I'm doing like you said and making me one big template with comments and all of how to do certain things so I can cut and paste the "tricks" thanks for the advice
TonytheBookworm is offline  
Advert
Old 09-03-2010, 02:53 PM   #2613
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by TonytheBookworm View Post
If I have a rss feed that shows some linkes ...
but that link gets automatically changed...
How the heck could I get the url that is produced when the article loads? Cause to get the print version all I need to do is
When you have an RSS feed that redirects you or sends you to an article link that's not easily figured out from the RSS link, you have two basic solutions.

The first is to skip the idea of getting the print version. Just use keep_only and remove_tags, etc. to keep what you want from the main non-print article. That's my preferred solution.
The other is to treat the link as being obfuscated.
Starson17 is offline  
Old 09-03-2010, 05:19 PM   #2614
TonytheBookworm
Addict
TonytheBookworm is on a distinguished road
 
TonytheBookworm's Avatar
 
Posts: 264
Karma: 62
Join Date: May 2010
Device: kindle 2, kindle 3, Kindle fire
Quote:
Originally Posted by Starson17 View Post
When you have an RSS feed that redirects you or sends you to an article link that's not easily figured out from the RSS link, you have two basic solutions.

The first is to skip the idea of getting the print version. Just use keep_only and remove_tags, etc. to keep what you want from the main non-print article. That's my preferred solution.
The other is to treat the link as being obfuscated.

Thanks. I guess that is all the fun in this. Some of the feeds are hard as crap to figure out then others are easy. I think the easy ones tend to be more designed by professionals that actually take the time to follow general organizational patterns that's my thought. But in some cases I could be simply a sight trying to make it impossible to parse... Anyway thanks again and I did do that stuff in ultraedit and I love it. What I do is keep the actual myrecipe.txt open and then when i run the batch it tells me that the myrecipe.txt has been modified so i hit yes and see the changes. I really find that to be great. And I have also used the search feature to find where others did things like splits and remove and so on.
TonytheBookworm is offline  
Old 09-03-2010, 07:01 PM   #2615
TonytheBookworm
Addict
TonytheBookworm is on a distinguished road
 
TonytheBookworm's Avatar
 
Posts: 264
Karma: 62
Join Date: May 2010
Device: kindle 2, kindle 3, Kindle fire
if you were faced with something like this how would you remove it?
take a look at link : http://www.nfl.com/gamecenter/201009...cap/full-story
notice it has the fantasy football in it..
Spoiler:

Code:
<div style="">
<div style="margin: 5px; font-size: 11px; float: right; padding: 10px; background-color: rgb(255, 255, 255); border: 1px solid rgb(204, 204, 204); font-family: arial; width: 255px;">
<table>
<tbody><tr>
<td>
</td>
</tr>
<tr>
<td>
<a href="http://fantasy.nfl.com/" onclick="s_objectID=&quot;http://fantasy.nfl.com/_3&quot;;return this.s_oc?this.s_oc(e):true"><img border="0" class="teamslandinggrid" src="http://static.nfl.com/static/content/catch_all/nfl_image/Fantasy_250x150_1.jpg"></a>
</td>
</tr>
<tr>
<td colspan="3">
<br><b>2010 NFL.com fantasy football games</b>
<br>» <a href="http://fantasy.nfl.com/registration/privateleaguecreate" onclick="s_objectID=&quot;http://fantasy.nfl.com/registration/privateleaguecreate_1&quot;;return this.s_oc?this.s_oc(e):true">Create a customizable league</a>
<br>» <a href="http://fantasy.nfl.com/registration/leagueDirectory?leagueType=private" onclick="s_objectID=&quot;http://fantasy.nfl.com/registration/leagueDirectory?leagueType=private_1&quot;;return this.s_oc?this.s_oc(e):true">Join a custom private league</a>
<br>» <a href="http://fantasy.nfl.com/registration/leagueDirectory" onclick="s_objectID=&quot;http://fantasy.nfl.com/registration/leagueDirectory_1&quot;;return this.s_oc?this.s_oc(e):true">Join an NFL-managed league</a>
<br>» <a href="http://fantasy.nfl.com/draftcenter/mockdrafts" onclick="s_objectID=&quot;http://fantasy.nfl.com/draftcenter/mockdrafts_1&quot;;return this.s_oc?this.s_oc(e):true">Join a 10-team mock draft</a>
<br>
<br><b>Dominate your fantasy football draft!</b>
<br>» <a href="http://www.nfl.com/fantasy/draftkit" onclick="s_objectID=&quot;http://www.nfl.com/fantasy/draftkit_1&quot;;return this.s_oc?this.s_oc(e):true">NFL.com's in-depth draft kit</a>
<br>» <a href="http://www.nfl.com/fantasy/rankings" onclick="s_objectID=&quot;http://www.nfl.com/fantasy/rankings_1&quot;;return this.s_oc?this.s_oc(e):true">2010 fantasy player rankings</a>
<br>» <a href="http://www.nfl.com/goto?id=09000d5d817fb977" onclick="s_objectID=&quot;http://www.nfl.com/goto?id=09000d5d817fb977_1&quot;;return this.s_oc?this.s_oc(e):true">Complete profiles/projections</a>
<br>» <a href="http://www.nfl.com/fantasy" onclick="s_objectID=&quot;http://www.nfl.com/fantasy_1&quot;;return this.s_oc?this.s_oc(e):true">NFL.com Fantasy home page</a>
<br>
</td>
</tr>
</tbody></table>
</div></div>


I've tried doing a
Code:
remove_tags =[dict(attrs={'style':[""]})]
I even tried
Code:
def postprocess_html(self, soup):
         for tag in soup.findAll(attrs ={'style':[' ']}):
             tag.extract()
         return soup
all with no sucess. Am I just picking hard stuff to figure out or just common problems with someone just learning this stuff?
TonytheBookworm is offline  
Advert
Old 09-03-2010, 08:03 PM   #2616
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by TonytheBookworm View Post
if you were faced with something like this how would you remove it?
How about remove all <table> tags?

If that's too much, you could search to see if the table tag has a fantasy football link in it, and extract it only if it does.

You can do search and replace, etc.

I'd say they are "common problems with someone just learning this stuff?"
Starson17 is offline  
Old 09-03-2010, 09:19 PM   #2617
TonytheBookworm
Addict
TonytheBookworm is on a distinguished road
 
TonytheBookworm's Avatar
 
Posts: 264
Karma: 62
Join Date: May 2010
Device: kindle 2, kindle 3, Kindle fire
Quote:
Originally Posted by Starson17 View Post
How about remove all <table> tags?

If that's too much, you could search to see if the table tag has a fantasy football link in it, and extract it only if it does.

You can do search and replace, etc.

I'd say they are "common problems with someone just learning this stuff?"
What might be wrong with this? the removing of the whole table was a tad much. But i do notice they have the fantasy football link and the auto pass links in them. I honestly don't see why this doesn't work haha I hate being a noob and asking so many questions. Really I do cause I feel like I'm annoying people. Anyway here is what I have...

Spoiler:

Code:
def preprocess_html(self, soup):
       for article in table.findAll('table') :
            if article.find(href=re.compile('https://audiopass.nfl.com/nflap/secure/registerform?icampaign=AP_article') :
                article.extract()
            else :
                if article.find(href=re.compile('http://fantasy.nfl.com/') 
                  article.extract()
            else :
                continue
        return soup


my understanding of the above is it should find all instances of the <table> tag and then take and look inside that for the https and http links specified. If it finds either of them it will extract it from the soup. otherwise it will continue on. then return the soup without those links yet that doesn't happen
TonytheBookworm is offline  
Old 09-04-2010, 12:10 AM   #2618
TonytheBookworm
Addict
TonytheBookworm is on a distinguished road
 
TonytheBookworm's Avatar
 
Posts: 264
Karma: 62
Join Date: May 2010
Device: kindle 2, kindle 3, Kindle fire
New Recipe for Georgia Outdoor News.
Only issue with this is some of the links do not have actual titles but simply the text states Read More. If anyone cares to fix that feel free. This version only includes a print_version() of the page (aka without the pretty pictures). I might update it in the future to include the pics from the non print_version.
I didn't do the entire page only the hunting section for deer, waterfowl, wild life management, and then fishing for bass trout and fishing & lake reports. Enjoy.


P.S. When loaded on the kindle 2 it seems to cut the text off on the right hand side. I don't know if this is a bug because i seen something similar posted in the bug reports for calibre. But it appears the content is within a table and the user is forced to pan. Maybe someone can help me figure this issue out. Thanks
Attached Files
File Type: rar gon.rar (1.9 KB, 253 views)

Last edited by TonytheBookworm; 09-04-2010 at 01:36 PM. Reason: Fixed Table issues. Thanks Starson17 :)
TonytheBookworm is offline  
Old 09-04-2010, 06:04 AM   #2619
poloman
Enthusiast
poloman began at the beginning.
 
Posts: 25
Karma: 10
Join Date: Nov 2008
Device: PRS505, Kindle 3G
Hi! I'd like to learn how do do some feeds - I've read the tutorial and the site I'm after doesn't quite work - are there any tips/examples for feed burner based feeds?

Ideally, I'd like to create a recipe for The Daily Mash : http://feeds.feedburner.com/thedailymash

Thanks for any help you can give!
poloman is offline  
Old 09-04-2010, 10:23 AM   #2620
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by TonytheBookworm View Post
What might be wrong with this?
Code:
def preprocess_html(self, soup):
       for article in table.findAll('table') :
my understanding of the above is it should find all instances of the <table> tag and then take and look inside that for the https and http links specified. If it finds either of them it will extract it from the soup. otherwise it will continue on. then return the soup without those links yet that doesn't happen
It should be
Code:
def preprocess_html(self, soup):
       for article in soup.findAll('table') :
Otherwise, you are looking for table tags inside "table"
Starson17 is offline  
Old 09-04-2010, 10:28 AM   #2621
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by TonytheBookworm View Post
it appears the content is within a table and the user is forced to pan. Maybe someone can help me figure this issue out. Thanks
I remove all tables in recipes. They tend to cause trouble.

You can use:
conversion_options = {'linearize_tables' : True}
or something like:
Code:
    def postprocess_html(self, soup, first_fetch):
        for t in soup.findAll(['table', 'tr', 'td']):
            t.name = 'div'
Starson17 is offline  
Old 09-04-2010, 12:40 PM   #2622
TonytheBookworm
Addict
TonytheBookworm is on a distinguished road
 
TonytheBookworm's Avatar
 
Posts: 264
Karma: 62
Join Date: May 2010
Device: kindle 2, kindle 3, Kindle fire
Quote:
Originally Posted by Starson17 View Post
I remove all tables in recipes. They tend to cause trouble.

You can use:
conversion_options = {'linearize_tables' : True}
or something like:
Code:
    def postprocess_html(self, soup, first_fetch):
        for t in soup.findAll(['table', 'tr', 'td']):
            t.name = 'div'
AWWW!!! So that is why that was in there. I seen that in one of the other recipes but wasn't sure why it was there. Let me understand and correctly if I'm wrong. In the postprocess it is finding all instances of the table tr and td and then changing their name to div or making them div tags if you will.. One last thing while on the subject. I wasn't too clear on the postprocess_html parameters. It takes 3 arguments. The first 2 I understand but I'm confused about the first_fetch cause in some recipes I noticed they use first. So are these reserved words and if so what do they do exactly? Thanks again. Learning so much from you!!!
TonytheBookworm is offline  
Old 09-04-2010, 12:49 PM   #2623
TonytheBookworm
Addict
TonytheBookworm is on a distinguished road
 
TonytheBookworm's Avatar
 
Posts: 264
Karma: 62
Join Date: May 2010
Device: kindle 2, kindle 3, Kindle fire
Quote:
Originally Posted by poloman View Post
Hi! I'd like to learn how do do some feeds - I've read the tutorial and the site I'm after doesn't quite work - are there any tips/examples for feed burner based feeds?

Ideally, I'd like to create a recipe for The Daily Mash : http://feeds.feedburner.com/thedailymash

Thanks for any help you can give!
Poloman, my tip is this. And I don't mean to come across as rude by saying this. Do like I'm doing. Jump in head first with it. Even though I have programmed in C# for years, the python scripting is different for me. None the less, take and look at the recipes that are already provided. On a windows based system they are in /program files/calibre2/resources/recipes (or along that path).

First when you get it pulling the feed, then you will be hey that's not how i want it to look. So then you do like I did and go hmmm how do I remove the stuff. So i started doing a search in the recipes for remove and came across remove_tags and remove_tags_after and so on. Then also keep_only. I then took and tried those methods and if they worked I patted myself on the back and if they didn't then i took and posted segments of my code or in some cases the whole code in spoiler and code tags and the good folks on this site will generally help you out in a timely manner given you put for the effort. I know Starson17 has helped me big time along with a few others..

Bottom line is yes it is complicated to learn (heck i'm still figuring it out), but once you start to get the basics. You develop and arsenal to attack almost any feed you are faced with.

I for one feel defeated when I work on something for hours and then someone comes along instead of explaining what they done and simply doing it. Yes I'm grateful that they do that, yet on the same token I feel let down because I haven't learned anything..

So give it a try and let us know where we can help.

Here take a look at this to give you an idea... This should work for you but read the comments in it so you can get a understand of how i went about it. The only thing that I can't figure out on this is how to remove the style tags to get rid of the digg links and so forth at the bottom..

Spoiler:

Code:
from calibre.web.feeds.news import BasicNewsRecipe

class AdvancedUserRecipe1282101454(BasicNewsRecipe):
    title = 'Daily Mash'
    language = 'en'
    __author__ = 'TonytheBookworm'
    description = 'The Daily Mash'
    publisher = 'Tony Stegall'
    category = ''
    oldest_article = 7
    conversion_options = {'linearize_tables' : True}
    max_articles_per_feed = 100
    no_stylesheets = True
    
    masthead_url = 'http://www.thedailymash.co.uk/images/mashlogo5.gif'
    
      
   
    
    
    feeds          = [
                      ('Daily Mash', 'http://feeds.feedburner.com/thedailymash'),
                      
                    ]


    

    def print_version(self,url):
        split1 = url.split("?") # I take and search for all instances of ? in the url
        split2 = split1[1] # I then need to find the second part of the url to get what i need. it is 0 based index
        print 'THE SPLIT IS :', split2   # this is used to test to see what the results of the split is 
        
        #-----------------------------------------------------------------------------------------------
        #- This is how the orginal url comes in and how it needs to be converted to get a print version-
        #-----------------------------------------------------------------------------------------------
        
        #example of link to convert
        #Original link: http://www.thedailymash.co.uk/index.php?option=com_content&task=view&id=3060&Itemid=74
        #print version: http://www.thedailymash.co.uk/index2.php?option=com_content&task=view&id=3060&pop=1&page=0&Itemid=74
        
        #Now that I have my splits I take and piece it together
        #1) I take and have a constant url of www.thedailymash.co.uk/index2.php
        #2) I then want to take and append my split to the end of it
        #3) I then take and add the &page=0&pop=1 to the end 
        #4) I then get my needed url be in print format
        
        print_url = 'http://www.thedailymash.co.uk/index2.php?'+ split2 + '&page=0&pop=1'
        print 'print_url is: ', print_url
        return print_url

Last edited by TonytheBookworm; 09-04-2010 at 03:24 PM. Reason: added Recipe
TonytheBookworm is offline  
Old 09-04-2010, 01:20 PM   #2624
TonytheBookworm
Addict
TonytheBookworm is on a distinguished road
 
TonytheBookworm's Avatar
 
Posts: 264
Karma: 62
Join Date: May 2010
Device: kindle 2, kindle 3, Kindle fire
There was a typo pointed out in the West Hawaii Today online feed where the local feed didn't have a , in it.
Here is the updated version with the comma in it.
Attached Files
File Type: rar westhawaiitoday-update.rar (840 Bytes, 245 views)
TonytheBookworm is offline  
Old 09-04-2010, 03:48 PM   #2625
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by TonytheBookworm View Post
it is finding all instances of the table tr and td and then changing their name to div or making them div tags if you will..
Exactly correct.

Quote:
One last thing while on the subject. I wasn't too clear on the postprocess_html parameters. It takes 3 arguments. The first 2 I understand but I'm confused about the first_fetch cause in some recipes I noticed they use first. So are these reserved words and if so what do they do exactly? Thanks again. Learning so much from you!!!
So you want all the secrets eh? I quote: "first_fetch – True if this is the first page of an article."

You probably haven't used it much, but there is a recursion parameter that causes the recipe to follow links. The result is that links on the article page are fetched and work within the ebook. (By default it's off so links aren't followed/fetched).

I have a recipe of food recipes. The main food recipe on page 1 of the article may have a link to another food recipe, like a sauce or a side dish. I have recursion turned on to fetch those related recipes. First_fetch is true only on the first page.
Starson17 is offline  
Closed Thread


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Custom column read ? pchrist7 Calibre 2 10-04-2010 02:52 AM
Archive for custom screensavers sleeplessdave Amazon Kindle 1 07-07-2010 12:33 PM
How to back up preferences and custom recipes? greenapple Calibre 3 03-29-2010 05:08 AM
Donations for Custom Recipes ddavtian Calibre 5 01-23-2010 04:54 PM
Help understanding custom recipes andersent Calibre 0 12-17-2009 02:37 PM


All times are GMT -4. The time now is 07:24 PM.


MobileRead.com is a privately owned, operated and funded community.