View Single Post
Old 05-28-2011, 10:23 PM   #1
Razzia
Junior Member
Razzia began at the beginning.
 
Posts: 4
Karma: 10
Join Date: Apr 2011
Device: Amazon Kindle 3
Remove <br /> together with span, and only span

I am working on a recipe for the danish IT news site version2.dk - code is below in the spoiler. It's almost done. I have removed all items that shouldn't be there, but I have some nitpicking left.

In the articles there was some links to related articles. I removed those, but it leaves a rather large space between two segments.

To illustrate:
Quote:
Text piece

Text piece

link

Text piece
Having removed the link it now looks like this:
Quote:
Text piece

Text piece



Text piece
How do i remove the <br /> tag together with the link, but not between the text pieces?

Spoiler:

__license__ = 'GPL v3'
__copyright__ = '2011, Rasmus Lauritsen <rasmus at lauritsen.info>'
'''
version2.dk
'''

from calibre.web.feeds.news import BasicNewsRecipe

class version2(BasicNewsRecipe):
title = 'Version2.dk'
__author__ = 'Rasmus Lauritsen'
description = 'IT News'
publisher = 'version2.dk'
category = 'news, IT, hardware, software, Denmark'
oldest_article = 14
max_articles_per_feed = 50
no_stylesheets = True
remove_empty_feeds = True
use_embedded_content = False
encoding = 'iso-8859-1'
language = 'da'


feeds = [
(u'Seneste nyheder' , u'http://www.version2.dk/feeds/nyheder')
,(u'Forretningssoftware' , u'http://www.version2.dk/feeds/forretningssoftware')
,(u'Internet & styresystemer' , u'http://www.version2.dk/feeds/styresystemer')
,(u'It-arkitektur' , u'http://www.version2.dk/feeds/it-arkitektur')
,(u'It-styring & outsourcing' , u'http://www.version2.dk/feeds/it-styring')
,(u'Job & karriere' , u'http://www.version2.dk/feeds/karriere')
,(u'Mobil it & tele' , u'http://www.version2.dk/feeds/tele')
,(u'Server/storage & netværk' , u'http://www.version2.dk/feeds/server-storage')
,(u'Sikkerhed' , u'http://www.version2.dk/feeds/sikkerhed')
,(u'Softwareudvikling' , u'http://www.version2.dk/feeds/softwareudvikling')
]

keep_only_tags = [dict(name='div', attrs={'class':'article'})]
remove_tags = [
dict(name='p',attrs={'class':'meta links'}),
dict(name='div',attrs={'class':'float-right'}),
dict(name='span',attrs={'class':'article-link-id'})
]

def preprocess_html(self, soup):
for alink in soup.findAll('a'):
if alink.string is not None:
tstr = alink.string
alink.replaceWith(tstr)
return soup
Razzia is offline   Reply With Quote