Recipe not removing tags

Phoebus · 02-27-2019, 10:42 AM

Hi, I've created a recipe (follow up to an earlier post where I have since found a different feed with nicer HTML without infinite scroll) but I cannot for the life of me remove a specific tag.

I'm trying to remove <div class="side"> and/or <div class="spacer". I do want the tag <div class="md">, just not when it is nested within a "side" or "spacer" div.

As shown by the commented out code I have tried a few things (both using Beautiful Soup and without it) but nothing seems to work. Any suggestions?

The other problem is that some pages ask for me to click a button to confirm I want to view the page. Inspecting the code I can't see any <a> link it goes to. I've tried

return button['value'] == 'yes'

But to no avail. But that's secondary to removing the tags.

Code:

#!/usr/bin/env python2
# vim:fileencoding=utf-8
from __future__ import unicode_literals, division, absolute_import, print_function
from calibre.web.feeds.news import BasicNewsRecipe
from calibre.web.feeds.recipes import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup

class AdvancedUserRecipe1542030622(BasicNewsRecipe):
    title          = 'Strange Reddit'
    auto_cleanup   = False
    __author__ = 'Phoebus'
    language = 'en'
    description = "Strange tales"
    publisher = 'Reddit users'
    category = 'horror'
    oldest_article =40  # days
    max_articles_per_feed = 50
    no_stylesheets = True
    encoding = 'utf-8'
    remove_javascript = True
    use_embedded_content = False
    recursions = 11
    remove_attributes = ['size', 'style']


    feeds          = [
        (u'Articles', u'http://feeds.feedburner.com/CreepiestReddit-Month'),
    ]
    
    
    conversion_options = {
        'comment': description, 'tags': category, 'publisher': publisher, 'language': language
    }

    remove_tags_before = dict(id='top-matter')

    remove_tags = [

#		        dict(name='span', attrs={'class': [
#        									'flair',
#        									'flair ',
#        									'user',
#        									
#        													]}),

		        dict(name='div', attrs={'data-author': [
        									'AutoModerator',
        													]}), 
        													 
		        dict(name='a', attrs={'class': [
        									'expand',
        													]}),  
       													 
		        dict(name='div', attrs={'class': [
        									'titlebox',
        									'spacer',
#        									'side',
        													]}),          													
				dict(id='side'),
			
                dict(attrs={'class':'spacer'}),
				
    				]


    
    keep_only_tags = [  

		 	        dict(name='title'),
		 	   


                    dict(name='div', attrs={'class': [
        									'entry unvoted',
        									'md',
        													]}),
#        			dict(id='md'),										                     
 
                                                ]


  
    

    def is_link_wanted(self, url, a):
        return button['value'] == 'yes'


    def preprocess_html(self, soup):
#       for div in soup.findAll('div', attrs={'class':'side'}):
#            div.decompose()
#        soup.find('div', id='side').decompose()
#       for div in soup.find_all("div", {'class':'spacer'}): 
#            div.decompose()
		for div in soup('div', {'class':'side'}):
			div.decompose()
                return soup 

 
        
#    def postprocess_html(self, soup, first_fetch):
#        for div in soup.findAll(attrs={'class':'side'}):
#            div.decompose()
#        soup.find('div', id='side').decompose()
#       for div in soup.find_all("div", {'class':'spacer'}): 
#            div.decompose()
#		for div in soup('div', {'class':'side'}):
#			div.decompose()
 
#                return soup

kovidgoyal · 02-28-2019, 07:45 PM

Use something like this to match classes

Code:

def classes(classes):
    q = frozenset(classes.split(' '))
    return dict(attrs={
        'class': lambda x: x and frozenset(x.split()).intersection(q)})

remove_tags = [
classes('side spacer')
]
[/code]

Phoebus · 03-01-2019, 12:38 PM

Thanks, I couldn't get that to work even moving it around. I wonder if there is a different HTML used when the script interrogates it versus what I can see.

kovidgoyal · 03-01-2019, 07:15 PM

If you want to check the HTML the script sees, simplpy save it inside preprocess_raw_html something like

Code:

def reprocess_raw_html(self, html, *a):
    open('/path/to/somewhere/on/yuour/computer/file.html', 'wb').write(html.encode('utf-8'))
    return html

Phoebus · 03-04-2019, 05:00 AM

Thanks, I did that and the HTML from the log looks correct as does the pattern. Very odd. Tried experimenting with the order of keeping/removing tags to see if that made any difference but no.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Bulk Removing Series tags (when you only have one book)	AZBooks	Library Management	6	02-20-2017 10:30 AM
Removing all tags	harpangel36	Library Management	5	11-18-2012 07:39 PM
Removing all <div> tags?	ElMiko	Sigil	6	01-24-2012 05:51 PM
Epub and removing tags.	Billiam	Calibre	3	11-13-2010 12:14 PM
TAGS: removing outliers	alexxxm	Calibre	12	01-29-2010 10:52 AM

02-28-2019, 07:45 PM	#2
kovidgoyal creator of calibre Posts: 43,852 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Use something like this to match classes Code: def classes(classes): q = frozenset(classes.split(' ')) return dict(attrs={ 'class': lambda x: x and frozenset(x.split()).intersection(q)}) remove_tags = [ classes('side spacer') ] [/code]

03-01-2019, 12:38 PM	#3
Phoebus Member Posts: 22 Karma: 10 Join Date: Aug 2015 Device: Kobo Aura H2O	Thanks, I couldn't get that to work even moving it around. I wonder if there is a different HTML used when the script interrogates it versus what I can see.

03-01-2019, 07:15 PM	#4
kovidgoyal creator of calibre Posts: 43,852 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	If you want to check the HTML the script sees, simplpy save it inside preprocess_raw_html something like Code: def reprocess_raw_html(self, html, *a): open('/path/to/somewhere/on/yuour/computer/file.html', 'wb').write(html.encode('utf-8')) return html

03-04-2019, 05:00 AM	#5
Phoebus Member Posts: 22 Karma: 10 Join Date: Aug 2015 Device: Kobo Aura H2O	Thanks, I did that and the HTML from the log looks correct as does the pattern. Very odd. Tried experimenting with the order of keeping/removing tags to see if that made any difference but no.