Recipe not removing tags

Phoebus · 02-27-2019, 10:42 AM

Hi, I've created a recipe (follow up to an earlier post where I have since found a different feed with nicer HTML without infinite scroll) but I cannot for the life of me remove a specific tag.

I'm trying to remove <div class="side"> and/or <div class="spacer". I do want the tag <div class="md">, just not when it is nested within a "side" or "spacer" div.

As shown by the commented out code I have tried a few things (both using Beautiful Soup and without it) but nothing seems to work. Any suggestions?

The other problem is that some pages ask for me to click a button to confirm I want to view the page. Inspecting the code I can't see any <a> link it goes to. I've tried

return button['value'] == 'yes'

But to no avail. But that's secondary to removing the tags.

Code:

#!/usr/bin/env python2
# vim:fileencoding=utf-8
from __future__ import unicode_literals, division, absolute_import, print_function
from calibre.web.feeds.news import BasicNewsRecipe
from calibre.web.feeds.recipes import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup

class AdvancedUserRecipe1542030622(BasicNewsRecipe):
    title          = 'Strange Reddit'
    auto_cleanup   = False
    __author__ = 'Phoebus'
    language = 'en'
    description = "Strange tales"
    publisher = 'Reddit users'
    category = 'horror'
    oldest_article =40  # days
    max_articles_per_feed = 50
    no_stylesheets = True
    encoding = 'utf-8'
    remove_javascript = True
    use_embedded_content = False
    recursions = 11
    remove_attributes = ['size', 'style']


    feeds          = [
        (u'Articles', u'http://feeds.feedburner.com/CreepiestReddit-Month'),
    ]
    
    
    conversion_options = {
        'comment': description, 'tags': category, 'publisher': publisher, 'language': language
    }

    remove_tags_before = dict(id='top-matter')

    remove_tags = [

#		        dict(name='span', attrs={'class': [
#        									'flair',
#        									'flair ',
#        									'user',
#        									
#        													]}),

		        dict(name='div', attrs={'data-author': [
        									'AutoModerator',
        													]}), 
        													 
		        dict(name='a', attrs={'class': [
        									'expand',
        													]}),  
       													 
		        dict(name='div', attrs={'class': [
        									'titlebox',
        									'spacer',
#        									'side',
        													]}),          													
				dict(id='side'),
			
                dict(attrs={'class':'spacer'}),
				
    				]


    
    keep_only_tags = [  

		 	        dict(name='title'),
		 	   


                    dict(name='div', attrs={'class': [
        									'entry unvoted',
        									'md',
        													]}),
#        			dict(id='md'),										                     
 
                                                ]


  
    

    def is_link_wanted(self, url, a):
        return button['value'] == 'yes'


    def preprocess_html(self, soup):
#       for div in soup.findAll('div', attrs={'class':'side'}):
#            div.decompose()
#        soup.find('div', id='side').decompose()
#       for div in soup.find_all("div", {'class':'spacer'}): 
#            div.decompose()
		for div in soup('div', {'class':'side'}):
			div.decompose()
                return soup 

 
        
#    def postprocess_html(self, soup, first_fetch):
#        for div in soup.findAll(attrs={'class':'side'}):
#            div.decompose()
#        soup.find('div', id='side').decompose()
#       for div in soup.find_all("div", {'class':'spacer'}): 
#            div.decompose()
#		for div in soup('div', {'class':'side'}):
#			div.decompose()
 
#                return soup

kovidgoyal · 02-28-2019, 07:45 PM

Use something like this to match classes

Code:

def classes(classes):
    q = frozenset(classes.split(' '))
    return dict(attrs={
        'class': lambda x: x and frozenset(x.split()).intersection(q)})

remove_tags = [
classes('side spacer')
]
[/code]

Phoebus · 03-01-2019, 12:38 PM

Thanks, I couldn't get that to work even moving it around. I wonder if there is a different HTML used when the script interrogates it versus what I can see.

kovidgoyal · 03-01-2019, 07:15 PM

If you want to check the HTML the script sees, simplpy save it inside preprocess_raw_html something like

Code:

def reprocess_raw_html(self, html, *a):
    open('/path/to/somewhere/on/yuour/computer/file.html', 'wb').write(html.encode('utf-8'))
    return html

Phoebus · 03-04-2019, 05:00 AM

Thanks, I did that and the HTML from the log looks correct as does the pattern. Very odd. Tried experimenting with the order of keeping/removing tags to see if that made any difference but no.

Thread Tools	Search this Thread
Show Printable Version Email this Page	Search this Thread: Advanced Search

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Bulk Removing Series tags (when you only have one book)	AZBooks	Library Management	6	02-20-2017 10:30 AM
Removing all tags	harpangel36	Library Management	5	11-18-2012 07:39 PM
Removing all <div> tags?	ElMiko	Sigil	6	01-24-2012 05:51 PM
Epub and removing tags.	Billiam	Calibre	3	11-13-2010 12:14 PM
TAGS: removing outliers	alexxxm	Calibre	12	01-29-2010 10:52 AM

02-28-2019, 07:45 PM	#2
kovidgoyal creator of calibre Posts: 45,592 Karma: 28548962 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Use something like this to match classes Code: def classes(classes): q = frozenset(classes.split(' ')) return dict(attrs={ 'class': lambda x: x and frozenset(x.split()).intersection(q)}) remove_tags = [ classes('side spacer') ] [/code]

03-01-2019, 12:38 PM	#3
Phoebus Member Posts: 22 Karma: 10 Join Date: Aug 2015 Device: Kobo Aura H2O	Thanks, I couldn't get that to work even moving it around. I wonder if there is a different HTML used when the script interrogates it versus what I can see.

03-01-2019, 07:15 PM	#4
kovidgoyal creator of calibre Posts: 45,592 Karma: 28548962 Join Date: Oct 2006 Location: Mumbai, India Device: Various	If you want to check the HTML the script sees, simplpy save it inside preprocess_raw_html something like Code: def reprocess_raw_html(self, html, *a): open('/path/to/somewhere/on/yuour/computer/file.html', 'wb').write(html.encode('utf-8')) return html

03-04-2019, 05:00 AM	#5
Phoebus Member Posts: 22 Karma: 10 Join Date: Aug 2015 Device: Kobo Aura H2O	Thanks, I did that and the HTML from the log looks correct as does the pattern. Very odd. Tried experimenting with the order of keeping/removing tags to see if that made any difference but no.

Advert

Advert