Custom recipes (archive, read-only) - Page 74

Krittika Goyal · 01-07-2010, 11:19 PM

@lorenzov
Kovid created a wiki page
http://bugs.calibre-ebook.com/wiki/RecipeTips
that can be used to provide useful tips for recipes. right now its is almost empty. I would like to help you make this page.

wdrwc · 01-08-2010, 07:48 AM

I try to prepare a recipe for the gazeta.pl. I am testing it on one of their feeds:
http://serwisy.gazeta.pl/pub/rss/fb-technologie.xml

I prepared very simple custom recipe which should use printable version of the articles. However when I test the recipe with ebook-convert, articles are not dowloaded. ebook-convert reports it can not fetch articles, but the urls generated in the print_version() open without any problem in the browser.

Here is the part of the report from running ebook-convert --vv:

Code:

Downloading
Fetching http://technologie.gazeta.pl/technologie/2029020,82008,7432357.html
Could not fetch link http://technologie.gazeta.pl/technologie/2029020,82008,7432357.html
Traceback (most recent call last):
  File "site-packages\calibre\web\fetch\simple.py", line 401, in process_links
  File "site-packages\calibre\web\fetch\simple.py", line 208, in fetch_url
FetchError: Not Found

http://technologie.gazeta.pl/technologie/2029020,82008,7432357.html saved to 
Downloading
Fetching http://technologie.gazeta.pl/technologie/2029020,82008,7432282.html
Failed to download article: Korzystasz z Windows i Adobe Readera? Szykuj si� na �atanie... from http://technologie.gazeta.pl/technologie/1,82008,7432357,Korzystasz_z_Windows_i_Adobe_Readera__Szykuj_sie_na.html
Traceback (most recent call last):
  File "site-packages\calibre\utils\threadpool.py", line 95, in run
  File "site-packages\calibre\web\feeds\news.py", line 703, in fetch_article
  File "site-packages\calibre\web\feeds\news.py", line 699, in _fetch_article
Exception: Could not fetch article. Run with -vv to see the reason



2% Article download failed: u'Korzystasz z Windows i Adobe Readera? Szykuj si\u0119 na \u0142atanie...'
Could not fetch link http://technologie.gazeta.pl/technologie/2029020,82008,7432282.html
Traceback (most recent call last):
  File "site-packages\calibre\web\fetch\simple.py", line 401, in process_links
  File "site-packages\calibre\web\fetch\simple.py", line 208, in fetch_url
FetchError: Not Found

http://technologie.gazeta.pl/technologie/2029020,82008,7432282.html saved to

And here is the recipe:

Code:

#!/usr/bin/env  python
'''
technologie.gazeta.pl
'''
from calibre.web.feeds.news import BasicNewsRecipe
class TechnologieGazeta(BasicNewsRecipe):
    title          = u'TechnologieGazeta'
    description    = 'Wiadomości z technologie.gazeta.pl'
    language = 'en'

    language = 'pl'
    encoding = 'iso-8859-2'
    no_stylesheets = True
    remove_javascript = True
    max_articles_per_feed = 50
    simultaneous_downloads = 1

    feeds          = [
                      ('Wiadomosci Technologie gazeta.pl', 'http://serwisy.gazeta.pl/pub/rss/fb-technologie.xml'),
                    ]

    def print_version(self, url):
        start, sep, rest = url.rpartition('/')
        numbers, sep, tytul = rest.rpartition(',')
        printversion = numbers.replace('1,','2029020,',1)
        print( numbers,'  ',printversion)
        return start + '/' + printversion + '.html'

I would appreciate any help or suggestion.

Thanks,
wdrwc

cypherslock · 01-08-2010, 10:22 AM

I know I can subscribe to it via amazon, but as it is just the website content anyway, a custom recipe for Escapist Magazine would be awesome (http://www.escapistmagazine.com/).

nickredding · 01-08-2010, 01:46 PM

I'm writing a recipe to get the free parts of the Wall Street Journal. I'm getting "article download failed" for every article url, even though I can get to all of the urls in a browser. The urls all look like http://online.wsj.com/article/SB1000...n_AboveLEFTTop. Does anyone know why Calibre would be unable to download these pages?

evanmaastrigt · 01-08-2010, 03:48 PM

Quote:

Originally Posted by wdrwc

I try to prepare a recipe for the gazeta.pl. I am testing it on one of their feeds:
http://serwisy.gazeta.pl/pub/rss/fb-technologie.xml

I prepared very simple custom recipe which should use printable version of the articles...

Their print version is hard to get at, but I think it can be done (calibre knows some nice tricks too).

But the easy strategy is to forget the print version and just use the article from the feed. Their HTML seems to be valid, so you could use the keep_only_tags and remove_tags properties to get rid of unwanted content. There is also the preprocess_html() method to refine the result even further.

If you have further questions feel free to post them.

bamasteve · 01-09-2010, 12:44 AM

Lorenzo,

Thanks so much for the recepie. Very nice of you. Look forward to learning how to write my own. My new nook should arrive in a couple of week....they are backordered.

Krittika Goyal · 01-09-2010, 02:47 AM

Quote:

Originally Posted by nickredding

I'm writing a recipe to get the free parts of the Wall Street Journal. I'm getting "article download failed" for every article url, even though I can get to all of the urls in a browser. The urls all look like http://online.wsj.com/article/SB1000...n_AboveLEFTTop. Does anyone know why Calibre would be unable to download these pages?

If you send me your recipe I can take a look at it and see if i can figure something out.

lorenzov · 01-09-2010, 12:04 PM

try the attached one; obviously i have not included the videos and the forum posts, but as i was playing around with the fetching of various print versions of the feeds, it should do the job!

a questions for the experts in the forum:

is it possible to avoid repetition of articles? sometimes in different feeds (especially from blogs) it is possible to find duplicate articles. i'm trying to figure out if it is possible to prune duplicates after the fetch process

thanks!

lorenzo

kovidgoyal · 01-09-2010, 01:02 PM

@lorenzov: Not easily, the reason I haven't implemented it is that its usually a god idea to leave the duplicates in there, as a user might only read a single section

davotibarna · 01-09-2010, 01:40 PM

New Hungarian technical news recipe:

Code:

class SGhu(BasicNewsRecipe):
    title          = u'SG.hu'
    __author__     = 'davotibarna'
    description    = 'Informatika és Tudomány'
    language = 'hu'
    oldest_article = 5
    max_articles_per_feed = 100
    no_stylesheets = True
    encoding = 'ISO-8859-2'

    feeds          = [(u'SG.hu', u'http://www.sg.hu/plain/rss.xml')]

    def print_version(self, url):
        return url.replace('cikkek/', 'printer.php?cid=')

rjack · 01-09-2010, 07:00 PM

Folks,

I'm working on a solution for Dallas Morning News...

http://www.dallasnews.com/newskiosk/...latestnews.xml

There are lots of "extra text" above and below the main article if I just include all the newsfeeds I want.

Regards,

Robert

Krittika Goyal · 01-09-2010, 08:27 PM

@wdrwc
Kovid looked at your recipe and says the working recipe will be included in the next calibre release

Krittika Goyal · 01-09-2010, 08:30 PM

Quote:

Originally Posted by rjack

Folks,

I'm working on a solution for Dallas Morning News...

http://www.dallasnews.com/newskiosk/...latestnews.xml

There are lots of "extra text" above and below the main article if I just include all the newsfeeds I want.

Regards,

Robert

The recipe willbe included in the next calibre release

rjack · 01-09-2010, 09:47 PM

Krittika,

That is excellent...

1) I got a lot of good information by just pasting in all the newsfeeds I wanted...
2) I have reduced the amount of "garbage" by using the following tag but it takes a log time to run since I really don't know what I'm doing...

remove_tags_after = [dict(id='article_tools_bottom')]

3) I'm attaching my complete script. Maybe you can use it for your Dallas Morning News Testing..

Thanks,

Robert Jackson

Krittika Goyal · 01-10-2010, 12:22 AM

@rjack:
You are definitely on the right track. With a few more remove tags commands and a no_stylesheets command you should be fine. I am attaching a text file with the additional commands you need. Let me know if it works for you.

01-07-2010, 11:19 PM	#1096
Krittika Goyal Vox calibre Posts: 412 Karma: 1175230 Join Date: Jan 2009 Device: Sony reader prs700, kobo	@lorenzov Kovid created a wiki page http://bugs.calibre-ebook.com/wiki/RecipeTips that can be used to provide useful tips for recipes. right now its is almost empty. I would like to help you make this page. Last edited by kovidgoyal; 01-07-2010 at 11:33 PM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Custom column read ?	pchrist7	Calibre	2	10-04-2010 03:52 AM
Archive for custom screensavers	sleeplessdave	Amazon Kindle	1	07-07-2010 01:33 PM
How to back up preferences and custom recipes?	greenapple	Calibre	3	03-29-2010 06:08 AM
Donations for Custom Recipes	ddavtian	Calibre	5	01-23-2010 05:54 PM
Help understanding custom recipes	andersent	Calibre	0	12-17-2009 03:37 PM

01-08-2010, 10:22 AM	#1098
cypherslock Groupie Posts: 178 Karma: 12392 Join Date: Nov 2009 Location: Canada Device: Kobo Vox	I know I can subscribe to it via amazon, but as it is just the website content anyway, a custom recipe for Escapist Magazine would be awesome (http://www.escapistmagazine.com/).

01-08-2010, 01:46 PM	#1099
nickredding onlinenewsreader.net Posts: 334 Karma: 10143 Join Date: Dec 2009 Location: Kelowna BC Device: Various	I'm writing a recipe to get the free parts of the Wall Street Journal. I'm getting "article download failed" for every article url, even though I can get to all of the urls in a browser. The urls all look like http://online.wsj.com/article/SB1000...n_AboveLEFTTop. Does anyone know why Calibre would be unable to download these pages?

01-09-2010, 12:44 AM	#1101
bamasteve Junior Member Posts: 2 Karma: 10 Join Date: Jan 2010 Device: nook	Lorenzo, Thanks so much for the recepie. Very nice of you. Look forward to learning how to write my own. My new nook should arrive in a couple of week....they are backordered.

01-09-2010, 01:02 PM	#1104
kovidgoyal creator of calibre Posts: 45,711 Karma: 28549304 Join Date: Oct 2006 Location: Mumbai, India Device: Various	@lorenzov: Not easily, the reason I haven't implemented it is that its usually a god idea to leave the duplicates in there, as a user might only read a single section

01-09-2010, 07:00 PM	#1106
rjack Junior Member Posts: 6 Karma: 10 Join Date: Jan 2010 Device: Kindle 2, Windows Mobile, PC	Folks, I'm working on a solution for Dallas Morning News... http://www.dallasnews.com/newskiosk/...latestnews.xml There are lots of "extra text" above and below the main article if I just include all the newsfeeds I want. Regards, Robert

01-09-2010, 08:27 PM	#1107
Krittika Goyal Vox calibre Posts: 412 Karma: 1175230 Join Date: Jan 2009 Device: Sony reader prs700, kobo	@wdrwc Kovid looked at your recipe and says the working recipe will be included in the next calibre release

Advert

Advert