|
|
#1 |
|
doofus
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,555
Karma: 13089041
Join Date: Sep 2010
Device: Kobo Libra 2, Kindle Voyage
|
longform.org (My first recipe, please critique)
longform.org is an aggregate / curate site for long general-interest articles on the web. It has a proper feed but the links are to summaries on its own site, not to the original articles. Maybe there's a simple workaround for this, but I don't know so I wrote a recipe. It's my first and also first time doing something with python, so it's probably extremely naive.
Code:
import re
from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import Tag, NavigableString
class AdvancedUserRecipe1321856301(BasicNewsRecipe):
title = u'Longform.org'
__author__ = 'barty on mobileread.com forum'
publisher = 'longform.org'
category = 'essay, long form jounnalism'
max_articles_per_feed = 100
oldest_article = 365
auto_cleanup = True
feeds = [
(u'Editor\'s Picks', u'http://longform.org/category/editors-pick/feed'),
(u'More articles', u'http://longform.org/feed')
]
def parse_index(self):
self.cover_url = 'http://longform.org/wp-content/themes/grid_focus_april2011/images/longform_flag.jpg'
seen_urls = set([])
totalfeeds = []
lfeeds = self.get_feeds()
for feedobj in lfeeds:
feedtitle, feedurl = feedobj
articles = []
soup = self.index_to_soup(feedurl)
#for atag in soup.findAll(lambda tag: tag.name=='a' and tag.string and tag.string.lower()=='full story'):
for item in soup.findAll('item'):
content = item.find('content:encoded')
if content:
#m = re.search( r' href="(http://(?<!(long\.fm)).+?)">full story<', content.string, re.I)
m = re.search( r' href="(.+?)">full story<', content.contents[0], re.I)
if m:
url = m.group(1)
# skip promotionals
if url.startswith('http://long.fm') or url in seen_urls:
continue
seen_urls.add(url)
date = item.find('pubdate').contents[0]
date = date[:16] if date else ''
#print url
#print date
# there is a description tag but it is always truncated so prefer content:encoded
m = re.search( r'.+?<br\s*/>(.+)\[<a href="http://(www\.)?([^:/]+)', content.contents[0], re.DOTALL|re.I)
desc = '['+ m.group(3)+'] '+m.group(1) if m else item.description.contents[0]
#print desc
articles.append({'title':item.title.contents[0],'url':url,
'date':date,'description':desc})
totalfeeds.append((feedtitle, articles))
return totalfeeds
|
|
|
|
|
|
#2 |
|
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,618
Karma: 28549044
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
You can use the feeds by implementing get_article_url() in your recipe and returning the proper url there. You will have issues with sites that have multipage articles, presumably, that is the problem with vanity fair.
|
|
|
|
| Advert | |
|
|
|
|
#3 |
|
doofus
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,555
Karma: 13089041
Join Date: Sep 2010
Device: Kobo Libra 2, Kindle Voyage
|
Thank, Kovid. When I override get_article_url(), it is never called.
Regarding vanityfair, you're right that it's a split page problem. They do have print version. However, downloading the print version causes an error. Code:
Auto cleanup of URL: u'http://www.vanityfair.com/culture/features/2009/03/godfather200903.print' failed Code:
[C:\Program Files (x86)\Calibre2]ebook-convert longform.recipe .epub --test --d ebug-pipeline debug 1% Converting input to HTML... InputFormatPlugin: Recipe Input running 1% Fetching feeds... 1% Got feeds from index page 1% Trying to download cover... 34% Downloading cover from http://longform.org/wp-content/themes/grid_focus_apri l2011/images/longform_flag.jpg 1% Generating masthead... Synthesizing mastheadImage 1% Starting download [4 thread(s)]... 9% Article downloaded: u'The End of Borders and the Future of Books' 17% Article downloaded: u'The Sicario: A Ju\xe1rez Hit Man Speaks' 25% Article downloaded: u'The Assassination: The Reporters\u2019 Story' WARNING: Encoding detection confidence 76% Auto cleanup of URL: u'http://www.vanityfair.com/culture/features/2009/03/godfat her200903.print' failed 34% Article downloaded: u'The Godfather Wars' 34% Feeds downloaded to c:\temp\calibre_0.8.27_tmp_eln8ut\doddxu_plumber\index.h tml 34% Download finished Input debug saved to: C:\Program Files (x86)\Calibre2\debug\input Parsing all content... Forcing index.html into XHTML namespace Forcing feed_0/article_0/index.html into XHTML namespace Parsing file 'feed_0/index.html' as HTML Forcing feed_0/index.html into XHTML namespace Parsing file 'feed_1/index.html' as HTML Forcing feed_1/index.html into XHTML namespace Forcing feed_1/article_0/index.html into XHTML namespace Found microsoft markup, cleaning... Parsing file 'feed_0/article_1/index.html' as HTML Forcing feed_0/article_1/index.html into XHTML namespace Stripping comments and meta tags from feed_0/article_1/index.html File 'feed_0/article_1/index.html' missing <head/> element File 'feed_0/article_1/index.html' missing <body/> element Failed to parse content in feed_0/article_1/index.html Forcing feed_1/article_1/index.html into XHTML namespace Referenced file 'feed_0/article_1/index.html' not in manifest Referenced file 'feed_2/index.html' not found Found microsoft markup, cleaning... Parsing file 'feed_0/article_1/index.html' as HTML Forcing feed_0/article_1/index.html into XHTML namespace Stripping comments and meta tags from feed_0/article_1/index.html File 'feed_0/article_1/index.html' missing <head/> element File 'feed_0/article_1/index.html' missing <body/> element Python function terminated unexpectedly list index out of range (Error Code: 1) Traceback (most recent call last): File "site.py", line 132, in main File "site.py", line 109, in run_entry_point File "site-packages\calibre\ebooks\conversion\cli.py", line 287, in main File "site-packages\calibre\ebooks\conversion\plumber.py", line 968, in run File "site-packages\calibre\ebooks\conversion\plumber.py", line 1114, in creat e_oebbook File "site-packages\calibre\ebooks\oeb\reader.py", line 71, in __call__ File "site-packages\calibre\ebooks\oeb\reader.py", line 611, in _all_from_opf File "site-packages\calibre\ebooks\oeb\reader.py", line 261, in _manifest_from _opf File "site-packages\calibre\ebooks\oeb\reader.py", line 185, in _manifest_add_ missing File "site-packages\calibre\ebooks\oeb\base.py", line 1161, in fget File "site-packages\calibre\ebooks\oeb\base.py", line 1032, in _parse_xhtml IndexError: list index out of range Last edited by Barty; 11-28-2011 at 12:35 PM. |
|
|
|
|
|
#4 |
|
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,618
Karma: 28549044
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Your RSS reader is probably an online one that caches old articles. The HTML on that print version page is broken, you will need to preprocess it so that parsing works.
|
|
|
|
|
|
#5 |
|
Member
![]() Posts: 15
Karma: 10
Join Date: Aug 2012
Device: none
|
Apologizes for reviving an old thread but I was wondering if anyone had an updated recipe for Longform?
Thanks |
|
|
|
| Advert | |
|
|
|
|
#6 |
|
Enthusiast
![]() ![]() Posts: 43
Karma: 136
Join Date: Mar 2011
Device: Kindle Paperwhite
|
Not having taken a look at the recipe: not sure you know this, but you can change article delivery on longform to directly go to your Kindle. There is a drop-down next to "Suggest a Story".
Wouldn't this be enough? |
|
|
|
![]() |
|
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| skepticblog.org - Recipe not working | BuzzKill | Recipes | 4 | 07-01-2016 01:20 PM |
| help building recipe to rebelion.org | de.now | Recipes | 3 | 05-18-2012 05:49 AM |
| New recipe voxeu.org - image problem | bosplans | Recipes | 3 | 08-10-2011 07:35 PM |
| Recipe for talkorigins.org? | AGB | Recipes | 0 | 05-23-2011 01:38 PM |
| Recipe Suggestion: OnSuper8.Org | KindleKid | Calibre | 0 | 07-28-2009 01:31 PM |