Reddit feed with comments

Phoebus · 11-12-2018, 10:10 AM

Hello, I thought that I could set up a Reddit feed to get the top results for the past week for a key phrase. I used the basic feature in Calibre to get the feed and the original post but it doesn't capture the other users' comments. Any tips on what I should change?

I've put the RSS feed into Feedburner as well but makes no difference using http://feeds.feedburner.com/Redditco...esults-Testing

Thanks

Code:

#!/usr/bin/env python2
# vim:fileencoding=utf-8
from __future__ import unicode_literals, division, absolute_import, print_function
from calibre.web.feeds.news import BasicNewsRecipe

class AdvancedUserRecipe1542031690(BasicNewsRecipe):
    title          = 'Reddit testing'
    oldest_article = 7
    max_articles_per_feed = 100
    auto_cleanup   = False

    feeds          = [
        ('Reddit testing', 'https://www.reddit.com/search.xml?q=testing&sort=top&t=week'),
    ]

kovidgoyal · 11-13-2018, 12:00 AM

does the rss feed actually include the comments? If not you would need to get your recipe to scrape the actual reddit website.

Phoebus · 11-13-2018, 07:22 AM

No it doesn't. Thanks I did not realise, I wasn't sure if it scraped the RSS or used the RSS as a source of links like this feed http://feeds.feedburner.com/CrackedRSS/ used in this recipe.

That recipe uses feeds = [(u'Articles', u'http://feeds.feedburner.com/CrackedRSS/')] but changing it to format this way didn't help.

kovidgoyal · 11-13-2018, 10:16 AM

the field use_embedded_content in the recipe controls whether content is read from the feed or the linked page is scraped.

Phoebus · 11-13-2018, 04:58 PM

Thanks

Phoebus · 11-16-2018, 07:43 AM

Thanks again for your help. Here is an Alpha version of the code. Bugs:

a subreddit's automoderator rules will appear at the start of each post
in page links to images not pulled in (though may be for the best) eg those to imgur, i.reddit
some of the code is junk as I've cannibalised from other recipes and may not need to be there
subreddit name is not displayed in title

Usage: you must get your links as per these guides https://www.reddit.com/wiki/rss or https://www.reddit.com/r/pathogendav...ss_and_reddit/

For example I use it as a search to get results for horror stories, but you can use it for any search, subreddit, post, comments or users as per the links above.

I've set it for a weekly search but obviously you can change this.

Code:

#!/usr/bin/env python2
# vim:fileencoding=utf-8
from __future__ import unicode_literals, division, absolute_import, print_function
from calibre.web.feeds.news import BasicNewsRecipe

class AdvancedUserRecipe1542030622(BasicNewsRecipe):
    title          = 'Reddit weekly - alpha'
    auto_cleanup   = False
    __author__ = 'phoebus'
    language = 'en'
    description = "Tales from the internet"
    publisher = 'Reddit users'
    oldest_article =7  # days - change as required
    max_articles_per_feed = 50 # change as required
    no_stylesheets = True
    encoding = 'utf-8'
    remove_javascript = True
    use_embedded_content = False
    recursions = 11
    remove_attributes = ['size', 'style']


    feeds          = [
        (u'Articles', u'INSERT YOUR RSS LINK),
    ] # see https://www.reddit.com/wiki/rss or https://www.reddit.com/r/pathogendavid/comments/tv8m9/pathogendavids_guide_to_rss_and_reddit/'
    
    
    conversion_options = {
        'comment': description, 'tags': category, 'publisher': publisher, 'language': language
    }

    keep_only_tags = [  
                    dict(name='p', attrs={'class': [
                                                'title',
                                                            ]}),
                    dict(name='span', attrs={'class': [
                                                'domain',
                                                            ]}),                    
                    dict(name='div', attrs={'class': [
                                                'expando',
                                                            ]}),  
                    dict(name='h1', attrs={'class': [
                                                'hover redditname',
                                                            ]}),
                    dict(name='meta', attrs={'property': [
                                                'og:title',                                 
                                                            ]}),
                    dict(name='meta', attrs={'title'}),
                    dict(name='div', attrs={'class': [

                                                'entry unvoted',
                                                'usertext-body may-blank-within md-container ',
                                                'usertext-body may-blank-within md-container',  
                                                'md',                                                                     
                                                            ]}),
                    dict(name='div', attrs={'data-test-id': [
                                                'post-content',                                   
                                                            ]}), 
                    dict(name='div', attrs={'class': [
												's10usnt7-0 gxtxxZ'
                                                            ]}), 
                      ]

    remove_tags = [

		        dict(name='button'),
		        dict(name='span', attrs={'class': [
        									'flair',
        									'flair ',
        													]}),
		        dict(name='div', attrs={'data-author': [
        									'AutoModerator',
        													]}),  
		        dict(name='ul', attrs={'class': [
        									'flat-list buttons',
        													]}),        													
		        dict(name='input', attrs={'type': [
        									'hidden',
        													]}),   
 		        dict(name='svg'),
    				]


    def is_link_wanted(self, url, a):
        return a['class'] == 'next' and a.findParent('nav', attrs={'class':'PaginationContent'}) is not None

    def postprocess_html(self, soup, first_fetch):
        for div in soup.findAll(attrs={'data-author':'AutoModerator'}):
            div.extract()
        return soup

11-12-2018, 10:10 AM	#1
Phoebus Member Posts: 22 Karma: 10 Join Date: Aug 2015 Device: Kobo Aura H2O	Reddit feed with comments Hello, I thought that I could set up a Reddit feed to get the top results for the past week for a key phrase. I used the basic feature in Calibre to get the feed and the original post but it doesn't capture the other users' comments. Any tips on what I should change? I've put the RSS feed into Feedburner as well but makes no difference using http://feeds.feedburner.com/Redditco...esults-Testing Thanks Code: #!/usr/bin/env python2 # vim:fileencoding=utf-8 from __future__ import unicode_literals, division, absolute_import, print_function from calibre.web.feeds.news import BasicNewsRecipe class AdvancedUserRecipe1542031690(BasicNewsRecipe): title = 'Reddit testing' oldest_article = 7 max_articles_per_feed = 100 auto_cleanup = False feeds = [ ('Reddit testing', 'https://www.reddit.com/search.xml?q=testing&sort=top&t=week'), ] Last edited by Phoebus; 11-12-2018 at 12:19 PM.

11-13-2018, 07:22 AM	#3
Phoebus Member Posts: 22 Karma: 10 Join Date: Aug 2015 Device: Kobo Aura H2O	No it doesn't. Thanks I did not realise, I wasn't sure if it scraped the RSS or used the RSS as a source of links like this feed http://feeds.feedburner.com/CrackedRSS/ used in this recipe. That recipe uses feeds = [(u'Articles', u'http://feeds.feedburner.com/CrackedRSS/')] but changing it to format this way didn't help. Last edited by Phoebus; 11-13-2018 at 07:30 AM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
newbie guide - how can i quickly add custom rss feeds e.g reddit	wakkaday	Recipes	0	07-23-2017 04:34 PM
Reddit recipe	oCkz7bJ_	Recipes	0	08-06-2016 06:12 AM
Reddit RSS feed not pulling author info	jasonfedelem	Recipes	3	12-11-2014 12:28 AM
Free Kindle ebook lists on Reddit	carld	Deals and Resources (No Self-Promotion or Affiliate Links)	1	03-28-2013 12:29 AM
Sci-Fi Author to Answer Reddit Questions	Moejoe	News	1	04-07-2009 05:25 PM

11-13-2018, 12:00 AM	#2
kovidgoyal creator of calibre Posts: 45,609 Karma: 28549044 Join Date: Oct 2006 Location: Mumbai, India Device: Various	does the rss feed actually include the comments? If not you would need to get your recipe to scrape the actual reddit website.

11-13-2018, 10:16 AM	#4
kovidgoyal creator of calibre Posts: 45,609 Karma: 28549044 Join Date: Oct 2006 Location: Mumbai, India Device: Various	the field use_embedded_content in the recipe controls whether content is read from the feed or the linked page is scraped.

11-13-2018, 04:58 PM	#5
Phoebus Member Posts: 22 Karma: 10 Join Date: Aug 2015 Device: Kobo Aura H2O	Thanks

Advert

Advert