I am working on a recipe for the danish IT news site version2.dk - code is below in the spoiler. It's almost done. I have removed all items that shouldn't be there, but I have some nitpicking left.
In the articles there was some links to related articles. I removed those, but it leaves a rather large space between two segments.
To illustrate:
Quote:
Text piece
Text piece
link
Text piece
|
Having removed the link it now looks like this:
Quote:
Text piece
Text piece
Text piece
|
How do i remove the <br /> tag together with the link, but not between the text pieces?
Spoiler:
__license__ = 'GPL v3'
__copyright__ = '2011, Rasmus Lauritsen <rasmus at lauritsen.info>'
'''
version2.dk
'''
from calibre.web.feeds.news import BasicNewsRecipe
class version2(BasicNewsRecipe):
title = 'Version2.dk'
__author__ = 'Rasmus Lauritsen'
description = 'IT News'
publisher = 'version2.dk'
category = 'news, IT, hardware, software, Denmark'
oldest_article = 14
max_articles_per_feed = 50
no_stylesheets = True
remove_empty_feeds = True
use_embedded_content = False
encoding = 'iso-8859-1'
language = 'da'
feeds = [
(u'Seneste nyheder' , u'http://www.version2.dk/feeds/nyheder')
,(u'Forretningssoftware' , u'http://www.version2.dk/feeds/forretningssoftware')
,(u'Internet & styresystemer' , u'http://www.version2.dk/feeds/styresystemer')
,(u'It-arkitektur' , u'http://www.version2.dk/feeds/it-arkitektur')
,(u'It-styring & outsourcing' , u'http://www.version2.dk/feeds/it-styring')
,(u'Job & karriere' , u'http://www.version2.dk/feeds/karriere')
,(u'Mobil it & tele' , u'http://www.version2.dk/feeds/tele')
,(u'Server/storage & netværk' , u'http://www.version2.dk/feeds/server-storage')
,(u'Sikkerhed' , u'http://www.version2.dk/feeds/sikkerhed')
,(u'Softwareudvikling' , u'http://www.version2.dk/feeds/softwareudvikling')
]
keep_only_tags = [dict(name='div', attrs={'class':'article'})]
remove_tags = [
dict(name='p',attrs={'class':'meta links'}),
dict(name='div',attrs={'class':'float-right'}),
dict(name='span',attrs={'class':'article-link-id'})
]
def preprocess_html(self, soup):
for alink in soup.findAll('a'):
if alink.string is not None:
tstr = alink.string
alink.replaceWith(tstr)
return soup