|
|
#1 |
|
Member
![]() Posts: 22
Karma: 10
Join Date: Aug 2015
Device: Kobo Aura H2O
|
Recipe not removing tags
Hi, I've created a recipe (follow up to an earlier post where I have since found a different feed with nicer HTML without infinite scroll) but I cannot for the life of me remove a specific tag.
I'm trying to remove <div class="side"> and/or <div class="spacer". I do want the tag <div class="md">, just not when it is nested within a "side" or "spacer" div. As shown by the commented out code I have tried a few things (both using Beautiful Soup and without it) but nothing seems to work. Any suggestions? The other problem is that some pages ask for me to click a button to confirm I want to view the page. Inspecting the code I can't see any <a> link it goes to. I've tried return button['value'] == 'yes' But to no avail. But that's secondary to removing the tags. Code:
#!/usr/bin/env python2
# vim:fileencoding=utf-8
from __future__ import unicode_literals, division, absolute_import, print_function
from calibre.web.feeds.news import BasicNewsRecipe
from calibre.web.feeds.recipes import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup
class AdvancedUserRecipe1542030622(BasicNewsRecipe):
title = 'Strange Reddit'
auto_cleanup = False
__author__ = 'Phoebus'
language = 'en'
description = "Strange tales"
publisher = 'Reddit users'
category = 'horror'
oldest_article =40 # days
max_articles_per_feed = 50
no_stylesheets = True
encoding = 'utf-8'
remove_javascript = True
use_embedded_content = False
recursions = 11
remove_attributes = ['size', 'style']
feeds = [
(u'Articles', u'http://feeds.feedburner.com/CreepiestReddit-Month'),
]
conversion_options = {
'comment': description, 'tags': category, 'publisher': publisher, 'language': language
}
remove_tags_before = dict(id='top-matter')
remove_tags = [
# dict(name='span', attrs={'class': [
# 'flair',
# 'flair ',
# 'user',
#
# ]}),
dict(name='div', attrs={'data-author': [
'AutoModerator',
]}),
dict(name='a', attrs={'class': [
'expand',
]}),
dict(name='div', attrs={'class': [
'titlebox',
'spacer',
# 'side',
]}),
dict(id='side'),
dict(attrs={'class':'spacer'}),
]
keep_only_tags = [
dict(name='title'),
dict(name='div', attrs={'class': [
'entry unvoted',
'md',
]}),
# dict(id='md'),
]
def is_link_wanted(self, url, a):
return button['value'] == 'yes'
def preprocess_html(self, soup):
# for div in soup.findAll('div', attrs={'class':'side'}):
# div.decompose()
# soup.find('div', id='side').decompose()
# for div in soup.find_all("div", {'class':'spacer'}):
# div.decompose()
for div in soup('div', {'class':'side'}):
div.decompose()
return soup
# def postprocess_html(self, soup, first_fetch):
# for div in soup.findAll(attrs={'class':'side'}):
# div.decompose()
# soup.find('div', id='side').decompose()
# for div in soup.find_all("div", {'class':'spacer'}):
# div.decompose()
# for div in soup('div', {'class':'side'}):
# div.decompose()
# return soup
|
|
|
|
|
|
#2 |
|
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,592
Karma: 28548962
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Use something like this to match classes
Code:
def classes(classes):
q = frozenset(classes.split(' '))
return dict(attrs={
'class': lambda x: x and frozenset(x.split()).intersection(q)})
remove_tags = [ classes('side spacer') ] [/code] |
|
|
|
| Advert | |
|
|
|
|
#3 |
|
Member
![]() Posts: 22
Karma: 10
Join Date: Aug 2015
Device: Kobo Aura H2O
|
Thanks, I couldn't get that to work even moving it around. I wonder if there is a different HTML used when the script interrogates it versus what I can see.
|
|
|
|
|
|
#4 |
|
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,592
Karma: 28548962
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
If you want to check the HTML the script sees, simplpy save it inside preprocess_raw_html something like
Code:
def reprocess_raw_html(self, html, *a):
open('/path/to/somewhere/on/yuour/computer/file.html', 'wb').write(html.encode('utf-8'))
return html
|
|
|
|
|
|
#5 |
|
Member
![]() Posts: 22
Karma: 10
Join Date: Aug 2015
Device: Kobo Aura H2O
|
Thanks, I did that and the HTML from the log looks correct as does the pattern. Very odd. Tried experimenting with the order of keeping/removing tags to see if that made any difference but no.
|
|
|
|
| Advert | |
|
|
![]() |
| Thread Tools | Search this Thread |
|
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Bulk Removing Series tags (when you only have one book) | AZBooks | Library Management | 6 | 02-20-2017 10:30 AM |
| Removing all tags | harpangel36 | Library Management | 5 | 11-18-2012 07:39 PM |
| Removing all <div> tags? | ElMiko | Sigil | 6 | 01-24-2012 05:51 PM |
| Epub and removing tags. | Billiam | Calibre | 3 | 11-13-2010 12:14 PM |
| TAGS: removing outliers | alexxxm | Calibre | 12 | 01-29-2010 10:52 AM |