Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 11-21-2011, 07:41 PM   #1
Barty
doofus
Barty ought to be getting tired of karma fortunes by now.Barty ought to be getting tired of karma fortunes by now.Barty ought to be getting tired of karma fortunes by now.Barty ought to be getting tired of karma fortunes by now.Barty ought to be getting tired of karma fortunes by now.Barty ought to be getting tired of karma fortunes by now.Barty ought to be getting tired of karma fortunes by now.Barty ought to be getting tired of karma fortunes by now.Barty ought to be getting tired of karma fortunes by now.Barty ought to be getting tired of karma fortunes by now.Barty ought to be getting tired of karma fortunes by now.
 
Barty's Avatar
 
Posts: 2,520
Karma: 13036221
Join Date: Sep 2010
Device: Kobo Libra 2, Kindle Voyage
longform.org (My first recipe, please critique)

longform.org is an aggregate / curate site for long general-interest articles on the web. It has a proper feed but the links are to summaries on its own site, not to the original articles. Maybe there's a simple workaround for this, but I don't know so I wrote a recipe. It's my first and also first time doing something with python, so it's probably extremely naive.

Code:
import re

from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import Tag, NavigableString


class AdvancedUserRecipe1321856301(BasicNewsRecipe):
	title          = u'Longform.org'
	__author__     = 'barty on mobileread.com forum'
	publisher      = 'longform.org'
	category       = 'essay, long form jounnalism'
	max_articles_per_feed = 100
	oldest_article = 365
	auto_cleanup   = True
	feeds          = [
		(u'Editor\'s Picks', u'http://longform.org/category/editors-pick/feed'),
		(u'More articles', u'http://longform.org/feed')
		]

	def parse_index(self):
		self.cover_url = 'http://longform.org/wp-content/themes/grid_focus_april2011/images/longform_flag.jpg'
		seen_urls = set([])
		totalfeeds = []
		lfeeds = self.get_feeds()
		for feedobj in lfeeds:
			feedtitle, feedurl = feedobj
			articles = []
			soup = self.index_to_soup(feedurl)
			#for atag in soup.findAll(lambda tag: tag.name=='a' and tag.string and tag.string.lower()=='full story'):
			for item in soup.findAll('item'):
				content = item.find('content:encoded')
				if content:
					#m = re.search( r' href="(http://(?<!(long\.fm)).+?)">full story<', content.string, re.I)
					m = re.search( r' href="(.+?)">full story<', content.contents[0], re.I)
					if m:
						url = m.group(1)
						# skip promotionals
						if url.startswith('http://long.fm') or url in seen_urls:
							continue
						seen_urls.add(url)
						date = item.find('pubdate').contents[0]
						date = date[:16] if date else ''
						#print url
						#print date
						# there is a description tag but it is always truncated so prefer content:encoded
						m = re.search( r'.+?<br\s*/>(.+)\[<a href="http://(www\.)?([^:/]+)', content.contents[0], re.DOTALL|re.I)
						desc = '['+ m.group(3)+'] '+m.group(1) if m else item.description.contents[0]
						#print desc
						articles.append({'title':item.title.contents[0],'url':url,
							'date':date,'description':desc})
			totalfeeds.append((feedtitle, articles))
		return totalfeeds
It mostly works. Articles from vanityfair get truncated for some reason. The way I pull the url out of the content-encoded field is rather ugly (find doesn't work the way I thought/hoped ut would). Lastly, when I run in test mode, today's date is added to the book title (which I like), but not when running for real.
Barty is offline   Reply With Quote
Old 11-21-2011, 10:26 PM   #2
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,850
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
You can use the feeds by implementing get_article_url() in your recipe and returning the proper url there. You will have issues with sites that have multipage articles, presumably, that is the problem with vanity fair.
kovidgoyal is offline   Reply With Quote
Advert
Old 11-22-2011, 07:01 PM   #3
Barty
doofus
Barty ought to be getting tired of karma fortunes by now.Barty ought to be getting tired of karma fortunes by now.Barty ought to be getting tired of karma fortunes by now.Barty ought to be getting tired of karma fortunes by now.Barty ought to be getting tired of karma fortunes by now.Barty ought to be getting tired of karma fortunes by now.Barty ought to be getting tired of karma fortunes by now.Barty ought to be getting tired of karma fortunes by now.Barty ought to be getting tired of karma fortunes by now.Barty ought to be getting tired of karma fortunes by now.Barty ought to be getting tired of karma fortunes by now.
 
Barty's Avatar
 
Posts: 2,520
Karma: 13036221
Join Date: Sep 2010
Device: Kobo Libra 2, Kindle Voyage
Thank, Kovid. When I override get_article_url(), it is never called.

Regarding vanityfair, you're right that it's a split page problem. They do have print version. However, downloading the print version causes an error.

Code:
Auto cleanup of URL: u'http://www.vanityfair.com/culture/features/2009/03/godfather200903.print' failed
Full output below

Code:
[C:\Program Files (x86)\Calibre2]ebook-convert longform.recipe .epub --test  --d
ebug-pipeline debug
1% Converting input to HTML...
InputFormatPlugin: Recipe Input running
1% Fetching feeds...
1% Got feeds from index page
1% Trying to download cover...
34% Downloading cover from http://longform.org/wp-content/themes/grid_focus_apri
l2011/images/longform_flag.jpg
1% Generating masthead...
Synthesizing mastheadImage
1% Starting download [4 thread(s)]...
9% Article downloaded: u'The End of Borders and the Future of Books'
17% Article downloaded: u'The Sicario: A Ju\xe1rez Hit Man Speaks'
25% Article downloaded: u'The Assassination: The Reporters\u2019 Story'
WARNING: Encoding detection confidence 76%
Auto cleanup of URL: u'http://www.vanityfair.com/culture/features/2009/03/godfat
her200903.print' failed
34% Article downloaded: u'The Godfather Wars'
34% Feeds downloaded to c:\temp\calibre_0.8.27_tmp_eln8ut\doddxu_plumber\index.h
tml
34% Download finished
Input debug saved to: C:\Program Files (x86)\Calibre2\debug\input
Parsing all content...
Forcing index.html into XHTML namespace
Forcing feed_0/article_0/index.html into XHTML namespace
Parsing file 'feed_0/index.html' as HTML
Forcing feed_0/index.html into XHTML namespace
Parsing file 'feed_1/index.html' as HTML
Forcing feed_1/index.html into XHTML namespace
Forcing feed_1/article_0/index.html into XHTML namespace
Found microsoft markup, cleaning...
Parsing file 'feed_0/article_1/index.html' as HTML
Forcing feed_0/article_1/index.html into XHTML namespace
Stripping comments and meta tags from feed_0/article_1/index.html
File 'feed_0/article_1/index.html' missing <head/> element
File 'feed_0/article_1/index.html' missing <body/> element
Failed to parse content in feed_0/article_1/index.html
Forcing feed_1/article_1/index.html into XHTML namespace
Referenced file 'feed_0/article_1/index.html' not in manifest
Referenced file 'feed_2/index.html' not found
Found microsoft markup, cleaning...
Parsing file 'feed_0/article_1/index.html' as HTML
Forcing feed_0/article_1/index.html into XHTML namespace
Stripping comments and meta tags from feed_0/article_1/index.html
File 'feed_0/article_1/index.html' missing <head/> element
File 'feed_0/article_1/index.html' missing <body/> element
Python function terminated unexpectedly
  list index out of range (Error Code: 1)
Traceback (most recent call last):
  File "site.py", line 132, in main
  File "site.py", line 109, in run_entry_point
  File "site-packages\calibre\ebooks\conversion\cli.py", line 287, in main
  File "site-packages\calibre\ebooks\conversion\plumber.py", line 968, in run
  File "site-packages\calibre\ebooks\conversion\plumber.py", line 1114, in creat
e_oebbook
  File "site-packages\calibre\ebooks\oeb\reader.py", line 71, in __call__
  File "site-packages\calibre\ebooks\oeb\reader.py", line 611, in _all_from_opf
  File "site-packages\calibre\ebooks\oeb\reader.py", line 261, in _manifest_from
_opf
  File "site-packages\calibre\ebooks\oeb\reader.py", line 185, in _manifest_add_
missing
  File "site-packages\calibre\ebooks\oeb\base.py", line 1161, in fget
  File "site-packages\calibre\ebooks\oeb\base.py", line 1032, in _parse_xhtml
IndexError: list index out of range
I have another question: the feed recipe gives me only 15 or so articles even though my limit is set much higher than that. When I use my RSS reader, I can see many more articles going back many months, and I can use "load more articles" to get even more. Can I force it to get more articles?

Last edited by Barty; 11-28-2011 at 11:35 AM.
Barty is offline   Reply With Quote
Old 11-22-2011, 09:25 PM   #4
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,850
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Your RSS reader is probably an online one that caches old articles. The HTML on that print version page is broken, you will need to preprocess it so that parsing works.
kovidgoyal is offline   Reply With Quote
Old 01-09-2014, 12:49 PM   #5
jallan44
Member
jallan44 began at the beginning.
 
Posts: 15
Karma: 10
Join Date: Aug 2012
Device: none
Apologizes for reviving an old thread but I was wondering if anyone had an updated recipe for Longform?

Thanks
jallan44 is offline   Reply With Quote
Advert
Old 01-10-2014, 02:31 PM   #6
aerodynamik
Enthusiast
aerodynamik doesn't litteraerodynamik doesn't litter
 
Posts: 43
Karma: 136
Join Date: Mar 2011
Device: Kindle Paperwhite
Not having taken a look at the recipe: not sure you know this, but you can change article delivery on longform to directly go to your Kindle. There is a drop-down next to "Suggest a Story".

Wouldn't this be enough?
aerodynamik is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
skepticblog.org - Recipe not working BuzzKill Recipes 4 07-01-2016 12:20 PM
help building recipe to rebelion.org de.now Recipes 3 05-18-2012 04:49 AM
New recipe voxeu.org - image problem bosplans Recipes 3 08-10-2011 06:35 PM
Recipe for talkorigins.org? AGB Recipes 0 05-23-2011 12:38 PM
Recipe Suggestion: OnSuper8.Org KindleKid Calibre 0 07-28-2009 12:31 PM


All times are GMT -4. The time now is 12:16 PM.


MobileRead.com is a privately owned, operated and funded community.