11-12-2018, 09:10 AM | #1 |
Member
Posts: 22
Karma: 10
Join Date: Aug 2015
Device: Kobo Aura H2O
|
Reddit feed with comments
Hello, I thought that I could set up a Reddit feed to get the top results for the past week for a key phrase. I used the basic feature in Calibre to get the feed and the original post but it doesn't capture the other users' comments. Any tips on what I should change?
I've put the RSS feed into Feedburner as well but makes no difference using http://feeds.feedburner.com/Redditco...esults-Testing Thanks Code:
#!/usr/bin/env python2 # vim:fileencoding=utf-8 from __future__ import unicode_literals, division, absolute_import, print_function from calibre.web.feeds.news import BasicNewsRecipe class AdvancedUserRecipe1542031690(BasicNewsRecipe): title = 'Reddit testing' oldest_article = 7 max_articles_per_feed = 100 auto_cleanup = False feeds = [ ('Reddit testing', 'https://www.reddit.com/search.xml?q=testing&sort=top&t=week'), ] Last edited by Phoebus; 11-12-2018 at 11:19 AM. |
11-12-2018, 11:00 PM | #2 |
creator of calibre
Posts: 43,748
Karma: 22446736
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
does the rss feed actually include the comments? If not you would need to get your recipe to scrape the actual reddit website.
|
11-13-2018, 06:22 AM | #3 |
Member
Posts: 22
Karma: 10
Join Date: Aug 2015
Device: Kobo Aura H2O
|
No it doesn't. Thanks I did not realise, I wasn't sure if it scraped the RSS or used the RSS as a source of links like this feed http://feeds.feedburner.com/CrackedRSS/ used in this recipe.
That recipe uses feeds = [(u'Articles', u'http://feeds.feedburner.com/CrackedRSS/')] but changing it to format this way didn't help. Last edited by Phoebus; 11-13-2018 at 06:30 AM. |
11-13-2018, 09:16 AM | #4 |
creator of calibre
Posts: 43,748
Karma: 22446736
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
the field use_embedded_content in the recipe controls whether content is read from the feed or the linked page is scraped.
|
11-13-2018, 03:58 PM | #5 |
Member
Posts: 22
Karma: 10
Join Date: Aug 2015
Device: Kobo Aura H2O
|
Thanks
|
11-16-2018, 06:43 AM | #6 |
Member
Posts: 22
Karma: 10
Join Date: Aug 2015
Device: Kobo Aura H2O
|
Thanks again for your help. Here is an Alpha version of the code. Bugs:
Usage: you must get your links as per these guides https://www.reddit.com/wiki/rss or https://www.reddit.com/r/pathogendav...ss_and_reddit/ For example I use it as a search to get results for horror stories, but you can use it for any search, subreddit, post, comments or users as per the links above. I've set it for a weekly search but obviously you can change this. Code:
#!/usr/bin/env python2 # vim:fileencoding=utf-8 from __future__ import unicode_literals, division, absolute_import, print_function from calibre.web.feeds.news import BasicNewsRecipe class AdvancedUserRecipe1542030622(BasicNewsRecipe): title = 'Reddit weekly - alpha' auto_cleanup = False __author__ = 'phoebus' language = 'en' description = "Tales from the internet" publisher = 'Reddit users' oldest_article =7 # days - change as required max_articles_per_feed = 50 # change as required no_stylesheets = True encoding = 'utf-8' remove_javascript = True use_embedded_content = False recursions = 11 remove_attributes = ['size', 'style'] feeds = [ (u'Articles', u'INSERT YOUR RSS LINK), ] # see https://www.reddit.com/wiki/rss or https://www.reddit.com/r/pathogendavid/comments/tv8m9/pathogendavids_guide_to_rss_and_reddit/' conversion_options = { 'comment': description, 'tags': category, 'publisher': publisher, 'language': language } keep_only_tags = [ dict(name='p', attrs={'class': [ 'title', ]}), dict(name='span', attrs={'class': [ 'domain', ]}), dict(name='div', attrs={'class': [ 'expando', ]}), dict(name='h1', attrs={'class': [ 'hover redditname', ]}), dict(name='meta', attrs={'property': [ 'og:title', ]}), dict(name='meta', attrs={'title'}), dict(name='div', attrs={'class': [ 'entry unvoted', 'usertext-body may-blank-within md-container ', 'usertext-body may-blank-within md-container', 'md', ]}), dict(name='div', attrs={'data-test-id': [ 'post-content', ]}), dict(name='div', attrs={'class': [ 's10usnt7-0 gxtxxZ' ]}), ] remove_tags = [ dict(name='button'), dict(name='span', attrs={'class': [ 'flair', 'flair ', ]}), dict(name='div', attrs={'data-author': [ 'AutoModerator', ]}), dict(name='ul', attrs={'class': [ 'flat-list buttons', ]}), dict(name='input', attrs={'type': [ 'hidden', ]}), dict(name='svg'), ] def is_link_wanted(self, url, a): return a['class'] == 'next' and a.findParent('nav', attrs={'class':'PaginationContent'}) is not None def postprocess_html(self, soup, first_fetch): for div in soup.findAll(attrs={'data-author':'AutoModerator'}): div.extract() return soup Last edited by Phoebus; 11-19-2018 at 04:24 AM. |
Thread Tools | Search this Thread |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
newbie guide - how can i quickly add custom rss feeds e.g reddit | wakkaday | Recipes | 0 | 07-23-2017 03:34 PM |
Reddit recipe | oCkz7bJ_ | Recipes | 0 | 08-06-2016 05:12 AM |
Reddit RSS feed not pulling author info | jasonfedelem | Recipes | 3 | 12-10-2014 11:28 PM |
Free Kindle ebook lists on Reddit | carld | Deals and Resources (No Self-Promotion or Affiliate Links) | 1 | 03-27-2013 11:29 PM |
Sci-Fi Author to Answer Reddit Questions | Moejoe | News | 1 | 04-07-2009 04:25 PM |