Custom recipes (archive, read-only) - Page 129

Starson17 · 05-14-2010, 02:16 PM

Quote:

Originally Posted by mwheinz

Yeah - I've been trying traverse the soup with this:

Code:

   def preprocess_html(self, soup):
        for item in soup.body:
            print 'MHEINZ: [[['
            print item
            print ']]] MHEINZ\n\n'
        return soup

I usually just do this:

Code:

   def preprocess_html(self, soup):
            print 'The soup is: ', soup
        return soup

The purpose is to just see the html and pick out what I want to remove.

Quote:

Overall, though, it looks like soup is parsing to a particular depth and then stopping - it looks like there's a vast blob of html that it is treating as a blob of text.

That's why I suggested using preprocess_regexps. You can pick any chunk of the "vast blob" out and discard it. BeautifulSoup does a great job of handling malformed html, but it's not perfect. Trying to discard junk based on tags presumes that the part you want to discard can be identified by tags. If it can't, you can use regexp based methods to match the start and end of the text blob you want to remove, with regex string matching, without regard to whether that blob is marked with tags.

sdow1 · 05-14-2010, 08:39 PM

I just wanted to jump in and thank folks for trying with the whole prospect thing. This is well above my computer language skills (which are limited to html/css), and I appreciate the effort.

Didn't realize what a can of worms I was opening though!

gambarini · 05-16-2010, 05:40 AM

new recipe:
www.libero-news.it

italian daily newspaper

older recipe:
L'Espresso
italian weekly news
-- better viewing, now all feeds work, and 2 new feeds.
La Repubblica
-- better viewing, now all feeds work
, more efficient remove policy
Le Scienze
-- bettwer viewing, new feed

yamadharma · 05-17-2010, 03:57 AM

When Calibre fetches Instapaper, there is file generated and transferred successfully, but no content. The size of the file is 0.0 mb.
I think, Instapaper API changed.

kiklop74 · 05-17-2010, 10:06 AM

Updated recipe for instapaper.com:

pablofunes · 05-17-2010, 11:37 AM

Hi Kovid & Calibre community,

I've repaired the "new york review of books" recipe - one of Calibre's core recipes. It was missing all article's titles because of a change in the nybooks.com HTML configuration.

Where should I submit the patch to?

Regards,

Pablo Funes

PS: The patch is very simple. Where it says

keep_only_tags = [dict(id='article-body')]

It should be instead,

keep_only_tags = [dict(id=['article-body','page-title'])]

Quote:

Originally Posted by kovidgoyal

Since there have been a lot of custom recipe requests of late, I'm starting a sticky where they can be aggregated. Post requests for custom recipes here. Once you have a custom recipe that works well for you (please test both the LRF and EPUB versions), let me know and I'll include it into calibre so others can benefit from it as well.

kovidgoyal · 05-17-2010, 11:47 AM

@pablofunes: Thanks, I've applied your change.

gambarini · 05-17-2010, 01:46 PM

infomotori

Italian Car and MotorCicle News

mwheinz · 05-17-2010, 05:10 PM

American Prospect Recipe

sdow1 - try this recipe. It's very simple, strips out all formatting at the moment.

Code:

import re

class AdvancedUserRecipe1273850169(BasicNewsRecipe):
    title          = u'American Prospect'
    oldest_article = 7
    max_articles_per_feed = 100
    recursions = 0
    no_stylesheets = True
    remove_javascript = True

    keep_only_tags = [dict(name=['p','img'])]
	
    preprocess_regexps = [ 
        (re.compile('\r'),lambda match: ''),
        (re.compile(r'<head.*?<title>', re.DOTALL|re.IGNORECASE), lambda match: '<head><title>'),
        (re.compile(r'</title>.*?</head>', re.DOTALL|re.IGNORECASE), lambda match: '</title></head>'),
        (re.compile(r'<body.*?<div class="pad_10L10R">', re.DOTALL|re.IGNORECASE), lambda match: '<body><div>'),
        (re.compile(r'</div>.*</body>', re.DOTALL|re.IGNORECASE), lambda match: '</div></body>'),
    ]

    feeds       = [(u'Articles', u'feed://www.prospect.org/articles_rss.jsp')]

sdow1 · 05-18-2010, 07:38 AM

mwheinz:

That looks like it works!

Thanks so much for the help

sdow1 · 05-18-2010, 12:47 PM

Quote:

Originally Posted by mwheinz

American Prospect Recipe

sdow1 - try this recipe. It's very simple, strips out all formatting at the moment.

Code:

import re

class AdvancedUserRecipe1273850169(BasicNewsRecipe):
    title          = u'American Prospect'
    oldest_article = 7
    max_articles_per_feed = 100
    recursions = 0
    no_stylesheets = True
    remove_javascript = True

    keep_only_tags = [dict(name=['p','img'])]
	
    preprocess_regexps = [ 
        (re.compile('\r'),lambda match: ''),
        (re.compile(r'<head.*?<title>', re.DOTALL|re.IGNORECASE), lambda match: '<head><title>'),
        (re.compile(r'</title>.*?</head>', re.DOTALL|re.IGNORECASE), lambda match: '</title></head>'),
        (re.compile(r'<body.*?<div class="pad_10L10R">', re.DOTALL|re.IGNORECASE), lambda match: '<body><div>'),
        (re.compile(r'</div>.*</body>', re.DOTALL|re.IGNORECASE), lambda match: '</div></body>'),
    ]

    feeds       = [(u'Articles', u'feed://www.prospect.org/articles_rss.jsp')]

In looking at this further, the only thing I'd change for now is to change the oldest article limit (to 30), since TAP is a monthly magazine. I can do this myself on my copy, but just wanted to put it out there for anyone else.

mwheinz · 05-18-2010, 01:20 PM

@Sdow1 - thanks for the tip, I don't normally read AP.

@everybody Here's a bundle of 3 "political" recipes - the American Prospect, Factcheck and Politifact.

mlstein · 05-18-2010, 06:40 PM

http://www.tomdispatch.com/

I can't figure out how to get through feedburner to the google feed to the ctual articles...

mwheinz · 05-18-2010, 08:25 PM

mlstein,

Try this:

Code:

class TomDispatch(BasicNewsRecipe):
    title          = u'TomDispatch'
    __author__     = u'Michael Heinz'
    oldest_article = 21
    max_articles_per_feed = 100
    recursion = 2
    use_embedded_content = False
    no_stylesheets = True

    publication_type = 'magazine'
    masthead_url = 'http://www.tomdispatch.com/application/images/site/tomdispatch_logo_v1.gif'
    cover_url = 'http://www.tomdispatch.com/application/images/site/tomdispatch_logo_v1.gif'

    remove_tags = [ 
                     dict(name='div', attrs={'id':'postSideBar'}),
                  ]

    keep_only_tags = [dict(name='div', attrs={'id':'mainWide'})]
    
    feeds = [
              (u'Articles', u'feed://feeds.feedburner.com/tomdispatch/esUU'),
            ]

    def get_article_url(self, article):
        return article.get('feedburner_origlink', None)

hito1 · 05-18-2010, 08:41 PM

I'm new here, so I'm sorry if I'm not doing this right.

I couldn't find any recipe for Proceedings or Naval History magazines, they both have a free section that requires a registration:

http://www.usni.org/magazines/proceedings/index.asp

http://www.usni.org/magazines/navalhistory/index.asp

Thanks a lot.

-----------
Beside that request, I'd like to thank the The Economist (free) and the Foreign Affair (subscription) recipes, both worked pretty good on my Kindle.

05-17-2010, 03:57 AM	#1924
yamadharma Junior Member Posts: 2 Karma: 10 Join Date: May 2010 Device: lbook v3	Calibre not working with Instapaper fetch now When Calibre fetches Instapaper, there is file generated and transferred successfully, but no content. The size of the file is 0.0 mb. I think, Instapaper API changed.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Custom column read ?	pchrist7	Calibre	2	10-04-2010 02:52 AM
Archive for custom screensavers	sleeplessdave	Amazon Kindle	1	07-07-2010 12:33 PM
How to back up preferences and custom recipes?	greenapple	Calibre	3	03-29-2010 05:08 AM
Donations for Custom Recipes	ddavtian	Calibre	5	01-23-2010 04:54 PM
Help understanding custom recipes	andersent	Calibre	0	12-17-2009 02:37 PM

05-14-2010, 08:39 PM	#1922
sdow1 Connoisseur Posts: 55 Karma: 10 Join Date: Apr 2010 Location: new york city Device: nook, ipad	I just wanted to jump in and thank folks for trying with the whole prospect thing. This is well above my computer language skills (which are limited to html/css), and I appreciate the effort. Didn't realize what a can of worms I was opening though!

05-17-2010, 11:47 AM	#1927
kovidgoyal creator of calibre Posts: 46,083 Karma: 29579912 Join Date: Oct 2006 Location: Mumbai, India Device: Various	@pablofunes: Thanks, I've applied your change.

05-18-2010, 07:38 AM	#1930
sdow1 Connoisseur Posts: 55 Karma: 10 Join Date: Apr 2010 Location: new york city Device: nook, ipad	mwheinz: That looks like it works! Thanks so much for the help

05-18-2010, 06:40 PM	#1933
mlstein Enthusiast Posts: 49 Karma: 2062 Join Date: May 2010 Device: iPad (one)	http://www.tomdispatch.com/ I can't figure out how to get through feedburner to the google feed to the ctual articles...

05-18-2010, 08:41 PM	#1935
hito1 Junior Member Posts: 1 Karma: 10 Join Date: May 2010 Device: Kindle	I'm new here, so I'm sorry if I'm not doing this right. I couldn't find any recipe for Proceedings or Naval History magazines, they both have a free section that requires a registration: http://www.usni.org/magazines/proceedings/index.asp http://www.usni.org/magazines/navalhistory/index.asp Thanks a lot. ----------- Beside that request, I'd like to thank the The Economist (free) and the Foreign Affair (subscription) recipes, both worked pretty good on my Kindle.

Advert

Advert