02-27-2019, 10:42 AM | #1 |
Member
Posts: 22
Karma: 10
Join Date: Aug 2015
Device: Kobo Aura H2O
|
Recipe not removing tags
Hi, I've created a recipe (follow up to an earlier post where I have since found a different feed with nicer HTML without infinite scroll) but I cannot for the life of me remove a specific tag.
I'm trying to remove <div class="side"> and/or <div class="spacer". I do want the tag <div class="md">, just not when it is nested within a "side" or "spacer" div. As shown by the commented out code I have tried a few things (both using Beautiful Soup and without it) but nothing seems to work. Any suggestions? The other problem is that some pages ask for me to click a button to confirm I want to view the page. Inspecting the code I can't see any <a> link it goes to. I've tried return button['value'] == 'yes' But to no avail. But that's secondary to removing the tags. Code:
#!/usr/bin/env python2 # vim:fileencoding=utf-8 from __future__ import unicode_literals, division, absolute_import, print_function from calibre.web.feeds.news import BasicNewsRecipe from calibre.web.feeds.recipes import BasicNewsRecipe from calibre.ebooks.BeautifulSoup import BeautifulSoup class AdvancedUserRecipe1542030622(BasicNewsRecipe): title = 'Strange Reddit' auto_cleanup = False __author__ = 'Phoebus' language = 'en' description = "Strange tales" publisher = 'Reddit users' category = 'horror' oldest_article =40 # days max_articles_per_feed = 50 no_stylesheets = True encoding = 'utf-8' remove_javascript = True use_embedded_content = False recursions = 11 remove_attributes = ['size', 'style'] feeds = [ (u'Articles', u'http://feeds.feedburner.com/CreepiestReddit-Month'), ] conversion_options = { 'comment': description, 'tags': category, 'publisher': publisher, 'language': language } remove_tags_before = dict(id='top-matter') remove_tags = [ # dict(name='span', attrs={'class': [ # 'flair', # 'flair ', # 'user', # # ]}), dict(name='div', attrs={'data-author': [ 'AutoModerator', ]}), dict(name='a', attrs={'class': [ 'expand', ]}), dict(name='div', attrs={'class': [ 'titlebox', 'spacer', # 'side', ]}), dict(id='side'), dict(attrs={'class':'spacer'}), ] keep_only_tags = [ dict(name='title'), dict(name='div', attrs={'class': [ 'entry unvoted', 'md', ]}), # dict(id='md'), ] def is_link_wanted(self, url, a): return button['value'] == 'yes' def preprocess_html(self, soup): # for div in soup.findAll('div', attrs={'class':'side'}): # div.decompose() # soup.find('div', id='side').decompose() # for div in soup.find_all("div", {'class':'spacer'}): # div.decompose() for div in soup('div', {'class':'side'}): div.decompose() return soup # def postprocess_html(self, soup, first_fetch): # for div in soup.findAll(attrs={'class':'side'}): # div.decompose() # soup.find('div', id='side').decompose() # for div in soup.find_all("div", {'class':'spacer'}): # div.decompose() # for div in soup('div', {'class':'side'}): # div.decompose() # return soup |
02-28-2019, 07:45 PM | #2 |
creator of calibre
Posts: 43,852
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Use something like this to match classes
Code:
def classes(classes): q = frozenset(classes.split(' ')) return dict(attrs={ 'class': lambda x: x and frozenset(x.split()).intersection(q)}) remove_tags = [ classes('side spacer') ] [/code] |
03-01-2019, 12:38 PM | #3 |
Member
Posts: 22
Karma: 10
Join Date: Aug 2015
Device: Kobo Aura H2O
|
Thanks, I couldn't get that to work even moving it around. I wonder if there is a different HTML used when the script interrogates it versus what I can see.
|
03-01-2019, 07:15 PM | #4 |
creator of calibre
Posts: 43,852
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
If you want to check the HTML the script sees, simplpy save it inside preprocess_raw_html something like
Code:
def reprocess_raw_html(self, html, *a): open('/path/to/somewhere/on/yuour/computer/file.html', 'wb').write(html.encode('utf-8')) return html |
03-04-2019, 05:00 AM | #5 |
Member
Posts: 22
Karma: 10
Join Date: Aug 2015
Device: Kobo Aura H2O
|
Thanks, I did that and the HTML from the log looks correct as does the pattern. Very odd. Tried experimenting with the order of keeping/removing tags to see if that made any difference but no.
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Bulk Removing Series tags (when you only have one book) | AZBooks | Library Management | 6 | 02-20-2017 10:30 AM |
Removing all tags | harpangel36 | Library Management | 5 | 11-18-2012 07:39 PM |
Removing all <div> tags? | ElMiko | Sigil | 6 | 01-24-2012 05:51 PM |
Epub and removing tags. | Billiam | Calibre | 3 | 11-13-2010 12:14 PM |
TAGS: removing outliers | alexxxm | Calibre | 12 | 01-29-2010 10:52 AM |