longform.org (My first recipe, please critique)

Barty · 11-21-2011, 08:41 PM

longform.org is an aggregate / curate site for long general-interest articles on the web. It has a proper feed but the links are to summaries on its own site, not to the original articles. Maybe there's a simple workaround for this, but I don't know so I wrote a recipe. It's my first and also first time doing something with python, so it's probably extremely naive.

Code:

import re

from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import Tag, NavigableString


class AdvancedUserRecipe1321856301(BasicNewsRecipe):
	title          = u'Longform.org'
	__author__     = 'barty on mobileread.com forum'
	publisher      = 'longform.org'
	category       = 'essay, long form jounnalism'
	max_articles_per_feed = 100
	oldest_article = 365
	auto_cleanup   = True
	feeds          = [
		(u'Editor\'s Picks', u'http://longform.org/category/editors-pick/feed'),
		(u'More articles', u'http://longform.org/feed')
		]

	def parse_index(self):
		self.cover_url = 'http://longform.org/wp-content/themes/grid_focus_april2011/images/longform_flag.jpg'
		seen_urls = set([])
		totalfeeds = []
		lfeeds = self.get_feeds()
		for feedobj in lfeeds:
			feedtitle, feedurl = feedobj
			articles = []
			soup = self.index_to_soup(feedurl)
			#for atag in soup.findAll(lambda tag: tag.name=='a' and tag.string and tag.string.lower()=='full story'):
			for item in soup.findAll('item'):
				content = item.find('content:encoded')
				if content:
					#m = re.search( r' href="(http://(?<!(long\.fm)).+?)">full story<', content.string, re.I)
					m = re.search( r' href="(.+?)">full story<', content.contents[0], re.I)
					if m:
						url = m.group(1)
						# skip promotionals
						if url.startswith('http://long.fm') or url in seen_urls:
							continue
						seen_urls.add(url)
						date = item.find('pubdate').contents[0]
						date = date[:16] if date else ''
						#print url
						#print date
						# there is a description tag but it is always truncated so prefer content:encoded
						m = re.search( r'.+?<br\s*/>(.+)\[<a href="http://(www\.)?([^:/]+)', content.contents[0], re.DOTALL|re.I)
						desc = '['+ m.group(3)+'] '+m.group(1) if m else item.description.contents[0]
						#print desc
						articles.append({'title':item.title.contents[0],'url':url,
							'date':date,'description':desc})
			totalfeeds.append((feedtitle, articles))
		return totalfeeds

It mostly works. Articles from vanityfair get truncated for some reason. The way I pull the url out of the content-encoded field is rather ugly (find doesn't work the way I thought/hoped ut would). Lastly, when I run in test mode, today's date is added to the book title (which I like), but not when running for real.

kovidgoyal · 11-21-2011, 11:26 PM

You can use the feeds by implementing get_article_url() in your recipe and returning the proper url there. You will have issues with sites that have multipage articles, presumably, that is the problem with vanity fair.

Barty · 11-22-2011, 08:01 PM

Thank, Kovid. When I override get_article_url(), it is never called.

Regarding vanityfair, you're right that it's a split page problem. They do have print version. However, downloading the print version causes an error.

Code:

Auto cleanup of URL: u'http://www.vanityfair.com/culture/features/2009/03/godfather200903.print' failed

Full output below

Code:

[C:\Program Files (x86)\Calibre2]ebook-convert longform.recipe .epub --test  --d
ebug-pipeline debug
1% Converting input to HTML...
InputFormatPlugin: Recipe Input running
1% Fetching feeds...
1% Got feeds from index page
1% Trying to download cover...
34% Downloading cover from http://longform.org/wp-content/themes/grid_focus_apri
l2011/images/longform_flag.jpg
1% Generating masthead...
Synthesizing mastheadImage
1% Starting download [4 thread(s)]...
9% Article downloaded: u'The End of Borders and the Future of Books'
17% Article downloaded: u'The Sicario: A Ju\xe1rez Hit Man Speaks'
25% Article downloaded: u'The Assassination: The Reporters\u2019 Story'
WARNING: Encoding detection confidence 76%
Auto cleanup of URL: u'http://www.vanityfair.com/culture/features/2009/03/godfat
her200903.print' failed
34% Article downloaded: u'The Godfather Wars'
34% Feeds downloaded to c:\temp\calibre_0.8.27_tmp_eln8ut\doddxu_plumber\index.h
tml
34% Download finished
Input debug saved to: C:\Program Files (x86)\Calibre2\debug\input
Parsing all content...
Forcing index.html into XHTML namespace
Forcing feed_0/article_0/index.html into XHTML namespace
Parsing file 'feed_0/index.html' as HTML
Forcing feed_0/index.html into XHTML namespace
Parsing file 'feed_1/index.html' as HTML
Forcing feed_1/index.html into XHTML namespace
Forcing feed_1/article_0/index.html into XHTML namespace
Found microsoft markup, cleaning...
Parsing file 'feed_0/article_1/index.html' as HTML
Forcing feed_0/article_1/index.html into XHTML namespace
Stripping comments and meta tags from feed_0/article_1/index.html
File 'feed_0/article_1/index.html' missing <head/> element
File 'feed_0/article_1/index.html' missing <body/> element
Failed to parse content in feed_0/article_1/index.html
Forcing feed_1/article_1/index.html into XHTML namespace
Referenced file 'feed_0/article_1/index.html' not in manifest
Referenced file 'feed_2/index.html' not found
Found microsoft markup, cleaning...
Parsing file 'feed_0/article_1/index.html' as HTML
Forcing feed_0/article_1/index.html into XHTML namespace
Stripping comments and meta tags from feed_0/article_1/index.html
File 'feed_0/article_1/index.html' missing <head/> element
File 'feed_0/article_1/index.html' missing <body/> element
Python function terminated unexpectedly
  list index out of range (Error Code: 1)
Traceback (most recent call last):
  File "site.py", line 132, in main
  File "site.py", line 109, in run_entry_point
  File "site-packages\calibre\ebooks\conversion\cli.py", line 287, in main
  File "site-packages\calibre\ebooks\conversion\plumber.py", line 968, in run
  File "site-packages\calibre\ebooks\conversion\plumber.py", line 1114, in creat
e_oebbook
  File "site-packages\calibre\ebooks\oeb\reader.py", line 71, in __call__
  File "site-packages\calibre\ebooks\oeb\reader.py", line 611, in _all_from_opf
  File "site-packages\calibre\ebooks\oeb\reader.py", line 261, in _manifest_from
_opf
  File "site-packages\calibre\ebooks\oeb\reader.py", line 185, in _manifest_add_
missing
  File "site-packages\calibre\ebooks\oeb\base.py", line 1161, in fget
  File "site-packages\calibre\ebooks\oeb\base.py", line 1032, in _parse_xhtml
IndexError: list index out of range

I have another question: the feed recipe gives me only 15 or so articles even though my limit is set much higher than that. When I use my RSS reader, I can see many more articles going back many months, and I can use "load more articles" to get even more. Can I force it to get more articles?

kovidgoyal · 11-22-2011, 10:25 PM

Your RSS reader is probably an online one that caches old articles. The HTML on that print version page is broken, you will need to preprocess it so that parsing works.

jallan44 · 01-09-2014, 01:49 PM

Apologizes for reviving an old thread but I was wondering if anyone had an updated recipe for Longform?

Thanks

aerodynamik · 01-10-2014, 03:31 PM

Not having taken a look at the recipe: not sure you know this, but you can change article delivery on longform to directly go to your Kindle. There is a drop-down next to "Suggest a Story".

Wouldn't this be enough?

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
skepticblog.org - Recipe not working	BuzzKill	Recipes	4	07-01-2016 01:20 PM
help building recipe to rebelion.org	de.now	Recipes	3	05-18-2012 05:49 AM
New recipe voxeu.org - image problem	bosplans	Recipes	3	08-10-2011 07:35 PM
Recipe for talkorigins.org?	AGB	Recipes	0	05-23-2011 01:38 PM
Recipe Suggestion: OnSuper8.Org	KindleKid	Calibre	0	07-28-2009 01:31 PM

11-21-2011, 11:26 PM	#2
kovidgoyal creator of calibre Posts: 45,618 Karma: 28549044 Join Date: Oct 2006 Location: Mumbai, India Device: Various	You can use the feeds by implementing get_article_url() in your recipe and returning the proper url there. You will have issues with sites that have multipage articles, presumably, that is the problem with vanity fair.

11-22-2011, 10:25 PM	#4
kovidgoyal creator of calibre Posts: 45,618 Karma: 28549044 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Your RSS reader is probably an online one that caches old articles. The HTML on that print version page is broken, you will need to preprocess it so that parsing works.

01-09-2014, 01:49 PM	#5
jallan44 Member Posts: 15 Karma: 10 Join Date: Aug 2012 Device: none	Apologizes for reviving an old thread but I was wondering if anyone had an updated recipe for Longform? Thanks

01-10-2014, 03:31 PM	#6
aerodynamik Enthusiast Posts: 43 Karma: 136 Join Date: Mar 2011 Device: Kindle Paperwhite	Not having taken a look at the recipe: not sure you know this, but you can change article delivery on longform to directly go to your Kindle. There is a drop-down next to "Suggest a Story". Wouldn't this be enough?

Advert

Advert