Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 02-22-2019, 08:34 AM   #1
Phoebus
Member
Phoebus began at the beginning.
 
Posts: 22
Karma: 10
Join Date: Aug 2015
Device: Kobo Aura H2O
Stumped by infinite scroll

Hi,

I've created a recipe to scrape Reddit searches each month. However I am only getting a few replies, in part as I think that Reddit has an infinite scroll, though this may not be the right term.

I can't follow up with 'more replies' either.

I've searched this forum and it looks like I should look for the Ajax script but can't seem to do this. Any tips?

Thanks

Code:
#!/usr/bin/env python2
# vim:fileencoding=utf-8
from __future__ import unicode_literals, division, absolute_import, print_function
from calibre.web.feeds.news import BasicNewsRecipe

class AdvancedUserRecipe1542030622(BasicNewsRecipe):
    title          = 'Monthy Reddit scrape'
    auto_cleanup   = False
    __author__ = '2019-02-22'
    language = 'en'
    description = "Creepiest tales on the internet"
    publisher = 'Reddit users'
    category = 'horror'
    oldest_article =40  # days
    max_articles_per_feed = 50
    no_stylesheets = True
    encoding = 'utf-8'
    remove_javascript = True
    use_embedded_content = False
    recursions = 11
    remove_attributes = ['size', 'style']


    feeds          = [
        (u'Articles', u'http://feeds.feedburner.com/CreepiestReddit-Month'),
    ]
    
    
    conversion_options = {
        'comment': description, 'tags': category, 'publisher': publisher, 'language': language
    }

    keep_only_tags = [  
                    dict(name='p', attrs={'class': [
                                                'title',
                                                            ]}),
                    dict(name='span', attrs={'class': [
                                                'domain',
                                                            ]}),                    

                    dict(name='div', attrs={'tabindex': [
												'-1',
                                                            ]}),
                    dict(name='div', attrs={'data-test-id': [
                                                'post-content',                             
                                                            ]}), 

                                        dict(name='span'),
                
                                                ]

    remove_tags = [

		        dict(name='button'),
		        dict(name='span', attrs={'class': [
        									'flair',
        									'flair ',
        									's6wlmco-0 jecSt',
        									's7pq5uy-2 iCbvoa',
        									'cu1hzx-0 iogJLn',
        									's6wlmco-3 bsaIpo',
        									
        													]}),
		        dict(name='div', attrs={'data-author': [
        									'AutoModerator',
        													]}),  
		        dict(name='div', attrs={'data-redditstyle': [
        									'false',
        													]}),
		        dict(name='div', attrs={'class': [
        									's6wlmco-0 jecSt',
        									's7pq5uy-2 iCbvoa',
        									's1muqojl-0 jMnEuz',
        													]}),  
		        dict(name='ul', attrs={'class': [
        									'flat-list buttons',
        													]}),        													
		        dict(name='input', attrs={'type': [
        									'hidden',
        													]}),   
 		        dict(name='svg'),
 		        dict(name='i'),
 		        dict(name='img', attrs={'role': [
        									'presentation',
        													]}),   
    				]

    def is_link_wanted(self, url, a):
        return a['class'] == 'next' and a.findParent('nav', attrs={'class':'PaginationContent'}) is not None

    def postprocess_html(self, soup, first_fetch):
        for div in soup.findAll(attrs={'data-author':'AutoModerator'}):
            div.extract()
        return soup
Phoebus is offline   Reply With Quote
Old 02-22-2019, 10:58 PM   #2
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,843
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Use developer tools in your browser, see what ajax query the site makes to load more commentsand replicate that query in your recipe.
kovidgoyal is offline   Reply With Quote
Advert
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Troubleshooting Kindle 3 HELP! I'm stumped Battlestar Amazon Kindle 4 04-09-2018 06:20 PM
I'm Stumped.... 93terp Amazon Kindle 19 08-04-2016 04:27 PM
Stumped Grauheim ePub 5 10-21-2009 12:32 PM


All times are GMT -4. The time now is 06:29 PM.


MobileRead.com is a privately owned, operated and funded community.