Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 02-27-2019, 10:42 AM   #1
Phoebus
Member
Phoebus began at the beginning.
 
Posts: 22
Karma: 10
Join Date: Aug 2015
Device: Kobo Aura H2O
Recipe not removing tags

Hi, I've created a recipe (follow up to an earlier post where I have since found a different feed with nicer HTML without infinite scroll) but I cannot for the life of me remove a specific tag.

I'm trying to remove <div class="side"> and/or <div class="spacer". I do want the tag <div class="md">, just not when it is nested within a "side" or "spacer" div.

As shown by the commented out code I have tried a few things (both using Beautiful Soup and without it) but nothing seems to work. Any suggestions?

The other problem is that some pages ask for me to click a button to confirm I want to view the page. Inspecting the code I can't see any <a> link it goes to. I've tried

return button['value'] == 'yes'

But to no avail. But that's secondary to removing the tags.

Code:
#!/usr/bin/env python2
# vim:fileencoding=utf-8
from __future__ import unicode_literals, division, absolute_import, print_function
from calibre.web.feeds.news import BasicNewsRecipe
from calibre.web.feeds.recipes import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup

class AdvancedUserRecipe1542030622(BasicNewsRecipe):
    title          = 'Strange Reddit'
    auto_cleanup   = False
    __author__ = 'Phoebus'
    language = 'en'
    description = "Strange tales"
    publisher = 'Reddit users'
    category = 'horror'
    oldest_article =40  # days
    max_articles_per_feed = 50
    no_stylesheets = True
    encoding = 'utf-8'
    remove_javascript = True
    use_embedded_content = False
    recursions = 11
    remove_attributes = ['size', 'style']


    feeds          = [
        (u'Articles', u'http://feeds.feedburner.com/CreepiestReddit-Month'),
    ]
    
    
    conversion_options = {
        'comment': description, 'tags': category, 'publisher': publisher, 'language': language
    }

    remove_tags_before = dict(id='top-matter')

    remove_tags = [

#		        dict(name='span', attrs={'class': [
#        									'flair',
#        									'flair ',
#        									'user',
#        									
#        													]}),

		        dict(name='div', attrs={'data-author': [
        									'AutoModerator',
        													]}), 
        													 
		        dict(name='a', attrs={'class': [
        									'expand',
        													]}),  
       													 
		        dict(name='div', attrs={'class': [
        									'titlebox',
        									'spacer',
#        									'side',
        													]}),          													
				dict(id='side'),
			
                dict(attrs={'class':'spacer'}),
				
    				]


    
    keep_only_tags = [  

		 	        dict(name='title'),
		 	   


                    dict(name='div', attrs={'class': [
        									'entry unvoted',
        									'md',
        													]}),
#        			dict(id='md'),										                     
 
                                                ]


  
    

    def is_link_wanted(self, url, a):
        return button['value'] == 'yes'


    def preprocess_html(self, soup):
#       for div in soup.findAll('div', attrs={'class':'side'}):
#            div.decompose()
#        soup.find('div', id='side').decompose()
#       for div in soup.find_all("div", {'class':'spacer'}): 
#            div.decompose()
		for div in soup('div', {'class':'side'}):
			div.decompose()
                return soup 

 
        
#    def postprocess_html(self, soup, first_fetch):
#        for div in soup.findAll(attrs={'class':'side'}):
#            div.decompose()
#        soup.find('div', id='side').decompose()
#       for div in soup.find_all("div", {'class':'spacer'}): 
#            div.decompose()
#		for div in soup('div', {'class':'side'}):
#			div.decompose()
 
#                return soup
Phoebus is offline   Reply With Quote
Old 02-28-2019, 07:45 PM   #2
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,850
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Use something like this to match classes

Code:
def classes(classes):
    q = frozenset(classes.split(' '))
    return dict(attrs={
        'class': lambda x: x and frozenset(x.split()).intersection(q)})

remove_tags = [
classes('side spacer')
]
[/code]
kovidgoyal is offline   Reply With Quote
Old 03-01-2019, 12:38 PM   #3
Phoebus
Member
Phoebus began at the beginning.
 
Posts: 22
Karma: 10
Join Date: Aug 2015
Device: Kobo Aura H2O
Thanks, I couldn't get that to work even moving it around. I wonder if there is a different HTML used when the script interrogates it versus what I can see.
Phoebus is offline   Reply With Quote
Old 03-01-2019, 07:15 PM   #4
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,850
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
If you want to check the HTML the script sees, simplpy save it inside preprocess_raw_html something like

Code:
def reprocess_raw_html(self, html, *a):
    open('/path/to/somewhere/on/yuour/computer/file.html', 'wb').write(html.encode('utf-8'))
    return html
kovidgoyal is offline   Reply With Quote
Old 03-04-2019, 05:00 AM   #5
Phoebus
Member
Phoebus began at the beginning.
 
Posts: 22
Karma: 10
Join Date: Aug 2015
Device: Kobo Aura H2O
Thanks, I did that and the HTML from the log looks correct as does the pattern. Very odd. Tried experimenting with the order of keeping/removing tags to see if that made any difference but no.
Phoebus is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Bulk Removing Series tags (when you only have one book) AZBooks Library Management 6 02-20-2017 10:30 AM
Removing all tags harpangel36 Library Management 5 11-18-2012 07:39 PM
Removing all <div> tags? ElMiko Sigil 6 01-24-2012 05:51 PM
Epub and removing tags. Billiam Calibre 3 11-13-2010 12:14 PM
TAGS: removing outliers alexxxm Calibre 12 01-29-2010 10:52 AM


All times are GMT -4. The time now is 07:04 PM.


MobileRead.com is a privately owned, operated and funded community.