Custom recipes (archive, read-only) - Page 13

kovidgoyal · 02-07-2009, 03:39 PM

Quote:

Originally Posted by XanthanGum

Hi,

I've had similar problems with the recipes I've tried to create for other publications. If I knew more about how to fine tune the recipes I'd have more luck and would like to share them with everyone here.

What about a workshop for creating recipes? Those of you who understand the intricacies of how they work could team lead a workshop for us beginners. We could start out with simple examples and then gradually build on that so that we could solve the problem mentioned above.

Kovid? Dominic? Would such a workshop be possible? Is there anyone with time that could lead such a "class"?

The more recipes generated the better. I think it adds value to Kovid's fantastic program.

I would love to see such a workshop.

Tschuess (German for Bye)...

Xanthan Gum

I'm guessing the documentation is insufficient for what you need?

kilikini · 02-07-2009, 06:23 PM

Just want to say thanks for the Honolulu Advertiser and Star Bulletin, they work great!

Much appreciated

tbaac · 02-07-2009, 07:28 PM

Calibre looks to be a fantastic program Kovidgoyal. Thank you.

kiklop74: Thank you for the recipe for New Statesman. Unfortunately I'm having difficulties with it. Other (built in recipes) seem to work, but the python script of yours I cannot get to run. When I click "Download" to start the download, nothing happens.

I tried pasting the contents of the .py file and I tried using "Load recipe from file". I see the code loaded into the edit box but it seems not to do anything.

Any idea what I might be doing wrong? Thank you.

Edit: Having read in the manual in the "Tips for developing new recipes" section, I tried running each of the recommended commands from the command line (with the newstatesman.py filename) and it worked perfectly. So I don't quite understand why it won't work within the Calibre GUI. Hmmm.

tbaac · 02-07-2009, 08:11 PM

Okay, I'm not sure what was going wrong. I tried it from the command line, found that sometimes it seemed that it helped if I closed Calibre and reopened it. It works really well now, thank you.

I changed some feeds and ended up with this:

Code:

#!/usr/bin/env  python

__license__   = 'GPL v3'
__copyright__ = '2009, Darko Miletic <darko.miletic at gmail.com>'
'''
newstatesman.com
'''

class NewStatesman(BasicNewsRecipe):
    title                 = 'New Statesman'
    __author__            = 'Darko Miletic'
    description           = "Britain's award-winning current affairs magazine"
    publisher             = 'New Statesman'
    category              = 'news, UK, World'
    oldest_article        = 7
    max_articles_per_feed = 100
    no_stylesheets        = True
    use_embedded_content  = False
    encoding              = 'cp1252'
    remove_javascript     = True
    cover_url             = 'http://media.starbulletin.com/designimages/spacer.gif'

    html2lrf_options = [
                          '--comment'       , description
                        , '--base-font-size', '10'
                        , '--category'      , category
                        , '--publisher'     , publisher
                        ]
    
    html2epub_options = 'publisher="' + publisher + '"\ncomments="' + description + '"\ntags="' + category + '"'
    
    keep_only_tags = [dict(name='div', attrs={'class':'content-main'})]

    remove_tags = [
                    dict(name=['object','link','form','ul'])
                   ,dict(name='ul', attrs={'class':'post-article'})
                   ,dict(name='div' , attrs={'class':['tag-nav-container','article-base']})
                   ,dict(name='div' , attrs={'id':['reader-comments']})                    
                  ]
                        
    feeds = [
              (u'Politics', u'http://www.newstatesman.com/feeds/topics/politics.rss'), (u'Arts & Culture', u'http://www.newstatesman.com/feeds/topics/arts-and-culture.rss'), (u'Books', u'http://www.newstatesman.com/feeds/topics/books.rss'), (u'Life & Society', u'http://www.newstatesman.com/feeds/topics/life-and-society.rss'), (u'World Affairs', u'http://www.newstatesman.com/feeds/topics/world-affairs.rss'), (u'Columns - Martin Bright', u'http://www.newstatesman.com/feeds/writers/martin_bright.rss'), (u'Columns - Kira Cochrane', u'http://www.newstatesman.com/feeds/writers/kira_cochrane.rss'), (u'Columns - Hunter Davies', u'http://www.newstatesman.com/feeds/topics/world-affairs.rss'), (u'Columns - Noreena Hertz', u'http://www.newstatesman.com/feeds/writers/noreena_hertz.rss'), (u'Columns - Lindsey Hilsum', u'http://www.newstatesman.com/feeds/writers/lindsey_hilsum.rss'), (u'Columns - Darcus Howe', u'http://www.newstatesman.com/feeds/writers/darcus_howe.rss'), (u'Columns - Emma John', u'http://www.newstatesman.com/feeds/writers/emma_john.rss'), (u'Columns - Sadakat Kadri', u'http://www.newstatesman.com/feeds/writers/sadakat_kadri.rss'), (u'Columns - Mark Lynas', u'http://www.newstatesman.com/feeds/writers/mark_lynas.rss'), (u'Columns - Kevin Maguire', u'http://www.newstatesman.com/feeds/writers/kevin_maguire.rss'), (u'Columns - Rageh Omaar', u'http://www.newstatesman.com/feeds/writers/rageh_omaar.rss'), (u'Columns - John Pilger', u'http://www.newstatesman.com/feeds/writers/john_pilger.rss'), (u'Columns - Ziauddin Sardar', u'http://www.newstatesman.com/feeds/writers/ziauddin_sardar.rss'), (u'Columns - Clive Stafford-Smith', u'http://www.newstatesman.com/feeds/writers/clive_stafford_smith.rss'), (u'Columns - Michela Wrong', u'http://www.newstatesman.com/feeds/writers/michela_wrong.rss')
            ]

    def preprocess_html(self, soup):
        for item in soup.findAll(style=True):
            del item['style']
        mtag = '\n<meta http-equiv="Content-Language" content="en"/>\n'
        soup.head.insert(0,mtag)
        return soup

XanthanGum · 02-09-2009, 12:29 PM

Quote:

Originally Posted by kovidgoyal

I'm guessing the documentation is insufficient for what you need?

Kovid,

I did look at the FAQ and the samples provided there a while back. But I think the New York Times example was a bit too complex for me, at least at the time. I will go back, though, and study the examples in more depth. I also plan to print out more of the recipes to compare them to one another and the associated Web sites to try to figure out what each is doing.

I guess what I need to know is:

- When you guys come up with a well-working recipe for a site such as the New York Times or New Statesman, are you looking at the source HTML code from the site? How do you know what tags to remove, for example?

- How do you fetch an entire article from a news site? What code segment does that? For example, I downloaded Ars Technica today to read while at lunch. While reading the Ars Technica articles, I noticed that only a summary for each article is presented. You're told to click on a link to read the rest. I'd like to edit the recipe to see if I could get the rest of those articles. What code in Darko Miletic's New Statesman recipe forces the fetching of entire articles? Would the same code solve the Ars Technica problem or would it have to be changed in some way?

Instead of a workshop, would you or Darko (?) have time to answer such questions as mine above? I understand object-oriented programming languages like Java and C++, and know several of the older procedural languages, so I think I could grasp what I need to know to write more recipes if given some of the basics.

Thanks...

Xanthan Gum

kovidgoyal · 02-09-2009, 01:51 PM

Yes tags to remove are deduced from the source HTML

The simplest way to get the full text of the articles is if the website has a "Print version". If it does, you need to figure out how to map the URLs in the RSS feeds to the corresponding print version. Then encode that logic into the print_version method which takes a url and should return the print version of the URL.

kiklop74 · 02-09-2009, 02:03 PM

Quote:

Originally Posted by XanthanGum

I guess what I need to know is:

- When you guys come up with a well-working recipe for a site such as the New York Times or New Statesman, are you looking at the source HTML code from the site?

Yes. The best way to browse quickly html is to get firefox and firebug plugin.

Quote:

Originally Posted by XanthanGum

How do you know what tags to remove, for example?

That is something you get with the time.

Quote:

Originally Posted by XanthanGum

- How do you fetch an entire article from a news site? What code segment does that?

Setting use_embedded_content to False does this.

Code:

use_embedded_content  = False

Quote:

Originally Posted by XanthanGum

Would the same code solve the Ars Technica problem or would it have to be changed in some way?

Yes it would.

What you need to read is actually documentation of the BasicNewsRecipe and see for yourself the actual code which is in general well comented.

The rest you can deduce from the multitude of existing recipes. You should start with more simple one's. The New York times is one of the more complex and it is not recommended for the beginners.

kiklop74 · 02-10-2009, 01:34 PM

New recipe for Montenegro newspaper "Pobjeda" (in Serbian)

Supports both LRF and EPUB format.

malkie13 · 02-10-2009, 08:20 PM

No clue how I'd go about making this work.

Currently I use the online version of FLAG (Fanfiction.net Lightweight Automated Grabber) from http://flag.erayd.net/ to grab Stories (multiple chapters at a go) from Fanfiction.net and them manually importing them into Calibre.

https://www.mobileread.com/forums/showthread.php?t=26055 has info and downloads on the FLAG program.

What would be ideal, however, would be a custom recipe, based on FLAG that would have an input for the Story ID that could then go about fetching the whole thing (as the stories are split across multiple "chapters" across several pages). Unfortunately, I can't code my way out of a paper sack, and haven't the foggiest idea how to do this sort of thing.

kiklop74 · 02-10-2009, 09:08 PM

I noticed one minor error in new release of calibre. Recipe "Politika Online" should also go to the serbian language category.

kovidgoyal · 02-10-2009, 10:19 PM

Quote:

Originally Posted by kiklop74

I noticed one minor error in new release of calibre. Recipe "Politika Online" should also go to the serbian language category.

Fixed.

XanthanGum · 02-11-2009, 02:40 PM

Quote:

Originally Posted by kovidgoyal

Yes tags to remove are deduced from the source HTML

The simplest way to get the full text of the articles is if the website has a "Print version". If it does, you need to figure out how to map the URLs in the RSS feeds to the corresponding print version. Then encode that logic into the print_version method which takes a url and should return the print version of the URL.

Kovid,

I understand how that works. I remember seeing the BBC example in the FAQ or tutorial. It made sense.

But many sites, like Ars Technica, don't offer that print option; you're forced to advance to the next page to read the rest of the article (when reading with a browser).

I tried kipklop74's suggestion by inserting the line:

use_embedded_content = False

in the recipe. But...it doesn't fetch the rest of the Ars Technica articles.

Any suggestions? (Kovid, Darko)

Xanthan Gum

XanthanGum · 02-11-2009, 02:47 PM

Quote:

Originally Posted by kiklop74

Yes. The best way to browse quickly html is to get firefox and firebug plugin.

That is something you get with the time.

Setting use_embedded_content to False does this.

Code:

use_embedded_content  = False

Yes it would.

What you need to read is actually documentation of the BasicNewsRecipe and see for yourself the actual code which is in general well comented.

The rest you can deduce from the multitude of existing recipes. You should start with more simple one's. The New York times is one of the more complex and it is not recommended for the beginners.

kiklop74,

Thanks for responding (you and Kovid). Firefox is the browser I use most times. I use Opera for some browsing. I don't think I have the firebug plugin installed so will get that.

When you state "Yes it would.", do you mean that the one line:

Code:

use_embedded_content  = False

will do the trick in the Ars Technica recipe or do you mean that something extra would have to be added with that line of code.

As I posted up above in response to Kovid's remarks about the print option, using just the

Code:

use_embedded_content  = False

line made no difference in the Ars Technica recipe.

I will, for sure, look over the documentation for the BasicNewsRecipe and print out a number of the recipes for comparison.

Xanthan Gum

kovidgoyal · 02-11-2009, 03:22 PM

Quote:

Originally Posted by XanthanGum

Kovid,

I understand how that works. I remember seeing the BBC example in the FAQ or tutorial. It made sense.

But many sites, like Ars Technica, don't offer that print option; you're forced to advance to the next page to read the rest of the article (when reading with a browser).

I tried kipklop74's suggestion by inserting the line:

use_embedded_content = False

in the recipe. But...it doesn't fetch the rest of the Ars Technica articles.

Any suggestions? (Kovid, Darko)

Xanthan Gum

Look at the Newsweek recipe it does this. i.e. it follows the next links

kiklop74 · 02-11-2009, 06:36 PM

The original Ars Technica recipe did have a problem with article length. Here is completely rewritten recipe that works well. Tested with both LRF and EPUB.

02-07-2009, 07:28 PM	#183
tbaac Member Posts: 11 Karma: 10 Join Date: Feb 2009 Device: Sony PRS505	Calibre looks to be a fantastic program Kovidgoyal. Thank you. kiklop74: Thank you for the recipe for New Statesman. Unfortunately I'm having difficulties with it. Other (built in recipes) seem to work, but the python script of yours I cannot get to run. When I click "Download" to start the download, nothing happens. I tried pasting the contents of the .py file and I tried using "Load recipe from file". I see the code loaded into the edit box but it seems not to do anything. Any idea what I might be doing wrong? Thank you. Edit: Having read in the manual in the "Tips for developing new recipes" section, I tried running each of the recommended commands from the command line (with the newstatesman.py filename) and it worked perfectly. So I don't quite understand why it won't work within the Calibre GUI. Hmmm. Last edited by tbaac; 02-07-2009 at 07:52 PM. Reason: Read something in the manual........

02-10-2009, 08:20 PM	#189
malkie13 Member Posts: 14 Karma: 10 Join Date: Feb 2009 Device: PRS-505	Request: Fanfiction.net No clue how I'd go about making this work. Currently I use the online version of FLAG (Fanfiction.net Lightweight Automated Grabber) from http://flag.erayd.net/ to grab Stories (multiple chapters at a go) from Fanfiction.net and them manually importing them into Calibre. https://www.mobileread.com/forums/showthread.php?t=26055 has info and downloads on the FLAG program. What would be ideal, however, would be a custom recipe, based on FLAG that would have an input for the Story ID that could then go about fetching the whole thing (as the stories are split across multiple "chapters" across several pages). Unfortunately, I can't code my way out of a paper sack, and haven't the foggiest idea how to do this sort of thing.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Custom column read ?	pchrist7	Calibre	2	10-04-2010 03:52 AM
Archive for custom screensavers	sleeplessdave	Amazon Kindle	1	07-07-2010 01:33 PM
How to back up preferences and custom recipes?	greenapple	Calibre	3	03-29-2010 06:08 AM
Donations for Custom Recipes	ddavtian	Calibre	5	01-23-2010 05:54 PM
Help understanding custom recipes	andersent	Calibre	0	12-17-2009 03:37 PM

02-07-2009, 06:23 PM	#182
kilikini Enthusiast Posts: 43 Karma: 376 Join Date: Jan 2009 Location: California, USA Device: K3, KFire, iPad, iPhone	Just want to say thanks for the Honolulu Advertiser and Star Bulletin, they work great! Much appreciated

02-09-2009, 01:51 PM	#186
kovidgoyal creator of calibre Posts: 45,622 Karma: 28549046 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Yes tags to remove are deduced from the source HTML The simplest way to get the full text of the articles is if the website has a "Print version". If it does, you need to figure out how to map the URLs in the RSS feeds to the corresponding print version. Then encode that logic into the print_version method which takes a url and should return the print version of the URL.

02-10-2009, 09:08 PM	#190
kiklop74 Guru Posts: 800 Karma: 194644 Join Date: Dec 2007 Location: Argentina Device: Kindle Voyage	I noticed one minor error in new release of calibre. Recipe "Politika Online" should also go to the serbian language category.

Advert

Advert