11-21-2011, 07:41 PM | #1 |
doofus
Posts: 2,529
Karma: 13088847
Join Date: Sep 2010
Device: Kobo Libra 2, Kindle Voyage
|
longform.org (My first recipe, please critique)
longform.org is an aggregate / curate site for long general-interest articles on the web. It has a proper feed but the links are to summaries on its own site, not to the original articles. Maybe there's a simple workaround for this, but I don't know so I wrote a recipe. It's my first and also first time doing something with python, so it's probably extremely naive.
Code:
import re from calibre.web.feeds.news import BasicNewsRecipe from calibre.ebooks.BeautifulSoup import Tag, NavigableString class AdvancedUserRecipe1321856301(BasicNewsRecipe): title = u'Longform.org' __author__ = 'barty on mobileread.com forum' publisher = 'longform.org' category = 'essay, long form jounnalism' max_articles_per_feed = 100 oldest_article = 365 auto_cleanup = True feeds = [ (u'Editor\'s Picks', u'http://longform.org/category/editors-pick/feed'), (u'More articles', u'http://longform.org/feed') ] def parse_index(self): self.cover_url = 'http://longform.org/wp-content/themes/grid_focus_april2011/images/longform_flag.jpg' seen_urls = set([]) totalfeeds = [] lfeeds = self.get_feeds() for feedobj in lfeeds: feedtitle, feedurl = feedobj articles = [] soup = self.index_to_soup(feedurl) #for atag in soup.findAll(lambda tag: tag.name=='a' and tag.string and tag.string.lower()=='full story'): for item in soup.findAll('item'): content = item.find('content:encoded') if content: #m = re.search( r' href="(http://(?<!(long\.fm)).+?)">full story<', content.string, re.I) m = re.search( r' href="(.+?)">full story<', content.contents[0], re.I) if m: url = m.group(1) # skip promotionals if url.startswith('http://long.fm') or url in seen_urls: continue seen_urls.add(url) date = item.find('pubdate').contents[0] date = date[:16] if date else '' #print url #print date # there is a description tag but it is always truncated so prefer content:encoded m = re.search( r'.+?<br\s*/>(.+)\[<a href="http://(www\.)?([^:/]+)', content.contents[0], re.DOTALL|re.I) desc = '['+ m.group(3)+'] '+m.group(1) if m else item.description.contents[0] #print desc articles.append({'title':item.title.contents[0],'url':url, 'date':date,'description':desc}) totalfeeds.append((feedtitle, articles)) return totalfeeds |
11-21-2011, 10:26 PM | #2 |
creator of calibre
Posts: 44,287
Karma: 23661992
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
You can use the feeds by implementing get_article_url() in your recipe and returning the proper url there. You will have issues with sites that have multipage articles, presumably, that is the problem with vanity fair.
|
Advert | |
|
11-22-2011, 07:01 PM | #3 |
doofus
Posts: 2,529
Karma: 13088847
Join Date: Sep 2010
Device: Kobo Libra 2, Kindle Voyage
|
Thank, Kovid. When I override get_article_url(), it is never called.
Regarding vanityfair, you're right that it's a split page problem. They do have print version. However, downloading the print version causes an error. Code:
Auto cleanup of URL: u'http://www.vanityfair.com/culture/features/2009/03/godfather200903.print' failed Code:
[C:\Program Files (x86)\Calibre2]ebook-convert longform.recipe .epub --test --d ebug-pipeline debug 1% Converting input to HTML... InputFormatPlugin: Recipe Input running 1% Fetching feeds... 1% Got feeds from index page 1% Trying to download cover... 34% Downloading cover from http://longform.org/wp-content/themes/grid_focus_apri l2011/images/longform_flag.jpg 1% Generating masthead... Synthesizing mastheadImage 1% Starting download [4 thread(s)]... 9% Article downloaded: u'The End of Borders and the Future of Books' 17% Article downloaded: u'The Sicario: A Ju\xe1rez Hit Man Speaks' 25% Article downloaded: u'The Assassination: The Reporters\u2019 Story' WARNING: Encoding detection confidence 76% Auto cleanup of URL: u'http://www.vanityfair.com/culture/features/2009/03/godfat her200903.print' failed 34% Article downloaded: u'The Godfather Wars' 34% Feeds downloaded to c:\temp\calibre_0.8.27_tmp_eln8ut\doddxu_plumber\index.h tml 34% Download finished Input debug saved to: C:\Program Files (x86)\Calibre2\debug\input Parsing all content... Forcing index.html into XHTML namespace Forcing feed_0/article_0/index.html into XHTML namespace Parsing file 'feed_0/index.html' as HTML Forcing feed_0/index.html into XHTML namespace Parsing file 'feed_1/index.html' as HTML Forcing feed_1/index.html into XHTML namespace Forcing feed_1/article_0/index.html into XHTML namespace Found microsoft markup, cleaning... Parsing file 'feed_0/article_1/index.html' as HTML Forcing feed_0/article_1/index.html into XHTML namespace Stripping comments and meta tags from feed_0/article_1/index.html File 'feed_0/article_1/index.html' missing <head/> element File 'feed_0/article_1/index.html' missing <body/> element Failed to parse content in feed_0/article_1/index.html Forcing feed_1/article_1/index.html into XHTML namespace Referenced file 'feed_0/article_1/index.html' not in manifest Referenced file 'feed_2/index.html' not found Found microsoft markup, cleaning... Parsing file 'feed_0/article_1/index.html' as HTML Forcing feed_0/article_1/index.html into XHTML namespace Stripping comments and meta tags from feed_0/article_1/index.html File 'feed_0/article_1/index.html' missing <head/> element File 'feed_0/article_1/index.html' missing <body/> element Python function terminated unexpectedly list index out of range (Error Code: 1) Traceback (most recent call last): File "site.py", line 132, in main File "site.py", line 109, in run_entry_point File "site-packages\calibre\ebooks\conversion\cli.py", line 287, in main File "site-packages\calibre\ebooks\conversion\plumber.py", line 968, in run File "site-packages\calibre\ebooks\conversion\plumber.py", line 1114, in creat e_oebbook File "site-packages\calibre\ebooks\oeb\reader.py", line 71, in __call__ File "site-packages\calibre\ebooks\oeb\reader.py", line 611, in _all_from_opf File "site-packages\calibre\ebooks\oeb\reader.py", line 261, in _manifest_from _opf File "site-packages\calibre\ebooks\oeb\reader.py", line 185, in _manifest_add_ missing File "site-packages\calibre\ebooks\oeb\base.py", line 1161, in fget File "site-packages\calibre\ebooks\oeb\base.py", line 1032, in _parse_xhtml IndexError: list index out of range Last edited by Barty; 11-28-2011 at 11:35 AM. |
11-22-2011, 09:25 PM | #4 |
creator of calibre
Posts: 44,287
Karma: 23661992
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Your RSS reader is probably an online one that caches old articles. The HTML on that print version page is broken, you will need to preprocess it so that parsing works.
|
01-09-2014, 12:49 PM | #5 |
Member
Posts: 15
Karma: 10
Join Date: Aug 2012
Device: none
|
Apologizes for reviving an old thread but I was wondering if anyone had an updated recipe for Longform?
Thanks |
Advert | |
|
01-10-2014, 02:31 PM | #6 |
Enthusiast
Posts: 43
Karma: 136
Join Date: Mar 2011
Device: Kindle Paperwhite
|
Not having taken a look at the recipe: not sure you know this, but you can change article delivery on longform to directly go to your Kindle. There is a drop-down next to "Suggest a Story".
Wouldn't this be enough? |
Thread Tools | Search this Thread |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
skepticblog.org - Recipe not working | BuzzKill | Recipes | 4 | 07-01-2016 12:20 PM |
help building recipe to rebelion.org | de.now | Recipes | 3 | 05-18-2012 04:49 AM |
New recipe voxeu.org - image problem | bosplans | Recipes | 3 | 08-10-2011 06:35 PM |
Recipe for talkorigins.org? | AGB | Recipes | 0 | 05-23-2011 12:38 PM |
Recipe Suggestion: OnSuper8.Org | KindleKid | Calibre | 0 | 07-28-2009 12:31 PM |