Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 05-28-2011, 10:23 PM   #1
Razzia
Junior Member
Razzia began at the beginning.
 
Posts: 4
Karma: 10
Join Date: Apr 2011
Device: Amazon Kindle 3
Remove <br /> together with span, and only span

I am working on a recipe for the danish IT news site version2.dk - code is below in the spoiler. It's almost done. I have removed all items that shouldn't be there, but I have some nitpicking left.

In the articles there was some links to related articles. I removed those, but it leaves a rather large space between two segments.

To illustrate:
Quote:
Text piece

Text piece

link

Text piece
Having removed the link it now looks like this:
Quote:
Text piece

Text piece



Text piece
How do i remove the <br /> tag together with the link, but not between the text pieces?

Spoiler:

__license__ = 'GPL v3'
__copyright__ = '2011, Rasmus Lauritsen <rasmus at lauritsen.info>'
'''
version2.dk
'''

from calibre.web.feeds.news import BasicNewsRecipe

class version2(BasicNewsRecipe):
title = 'Version2.dk'
__author__ = 'Rasmus Lauritsen'
description = 'IT News'
publisher = 'version2.dk'
category = 'news, IT, hardware, software, Denmark'
oldest_article = 14
max_articles_per_feed = 50
no_stylesheets = True
remove_empty_feeds = True
use_embedded_content = False
encoding = 'iso-8859-1'
language = 'da'


feeds = [
(u'Seneste nyheder' , u'http://www.version2.dk/feeds/nyheder')
,(u'Forretningssoftware' , u'http://www.version2.dk/feeds/forretningssoftware')
,(u'Internet & styresystemer' , u'http://www.version2.dk/feeds/styresystemer')
,(u'It-arkitektur' , u'http://www.version2.dk/feeds/it-arkitektur')
,(u'It-styring & outsourcing' , u'http://www.version2.dk/feeds/it-styring')
,(u'Job & karriere' , u'http://www.version2.dk/feeds/karriere')
,(u'Mobil it & tele' , u'http://www.version2.dk/feeds/tele')
,(u'Server/storage & netværk' , u'http://www.version2.dk/feeds/server-storage')
,(u'Sikkerhed' , u'http://www.version2.dk/feeds/sikkerhed')
,(u'Softwareudvikling' , u'http://www.version2.dk/feeds/softwareudvikling')
]

keep_only_tags = [dict(name='div', attrs={'class':'article'})]
remove_tags = [
dict(name='p',attrs={'class':'meta links'}),
dict(name='div',attrs={'class':'float-right'}),
dict(name='span',attrs={'class':'article-link-id'})
]

def preprocess_html(self, soup):
for alink in soup.findAll('a'):
if alink.string is not None:
tstr = alink.string
alink.replaceWith(tstr)
return soup
Razzia is offline   Reply With Quote
Old 05-29-2011, 10:41 AM   #2
Bonex
Connoisseur
Bonex began at the beginning.
 
Posts: 63
Karma: 10
Join Date: Oct 2010
Device: KDXG, Kobo Glo, Kobo Aura HD
Add the preprocess_regexps option:

Code:
  preprocess_regexps = [ (re.compile(r'</?a[^>]*>'),lambda match: ''),
                         (re.compile(r'<span[^>]*article-link-id.*?<br\s*\/?><br\s*\/?>'), lambda match: '')]

  keep_only_tags = [dict(name='div', attrs={'class':'article'})]

  remove_tags = [
   dict(name='p',attrs={'class':'meta links'}),
   dict(name='div',attrs={'class':'float-right'}),
   #dict(name='span',attrs={'class':'article-link-id'})
  ]

  feeds = [
The first one removes all <a> and </a> tags leaving the text inside, which I think is what you wanted to do with the preprocess_html function, the second ugly one removes all <span class="article-link-id">blabla</span> followed by two <br /> tags.
If you want a suggestion, you can add an extra_css option to tweak the final appearence of the article when displayed.
Bonex is offline   Reply With Quote
Advert
Old 05-29-2011, 06:51 PM   #3
Razzia
Junior Member
Razzia began at the beginning.
 
Posts: 4
Karma: 10
Join Date: Apr 2011
Device: Amazon Kindle 3
Thank you very much!

I just add'ed the following extra_css (which I borrowed from The New Yorker receipe), but I really can't see any difference on my kindle.

Code:
    extra_css             = """
                                body {font-family: "Times New Roman",Times,serif}
                                .articleauthor{color: #9F9F9F; 
                                               font-family: Arial, sans-serif;
                                               font-size: small; 
                                               text-transform: uppercase}
                                .rubric,.dd,h6#credit{color: #CD0021;
                                        font-family: Arial, sans-serif;
                                        font-size: small;
                                        text-transform: uppercase}
                                .descender:first-letter{display: inline; font-size: xx-large; font-weight: bold}
                                .dd,h6#credit{color: gray}
                                .c{display: block}
                                .caption,h2#articleintro{font-style: italic}
                                .caption{font-size: small}
                            """

Last edited by Razzia; 05-29-2011 at 07:01 PM.
Razzia is offline   Reply With Quote
Old 05-30-2011, 06:55 PM   #4
Bonex
Connoisseur
Bonex began at the beginning.
 
Posts: 63
Karma: 10
Join Date: Oct 2010
Device: KDXG, Kobo Glo, Kobo Aura HD
Quote:
Originally Posted by Razzia View Post
Thank you very much!

I just add'ed the following extra_css (which I borrowed from The New Yorker receipe), but I really can't see any difference on my kindle.
That's because that code is defining classes with different names than those used into the pages downloaded by your recipe.

Try to run your recipe with ebook-convert as explained here, then open one of the pages in /debug/input with a text editor (like Notepad++) to see how the html is after your recipe has cleaned it.
You have to provide css styles to the classes or elements as they are named there.
Bonex is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Trouble removing span class mufc Recipes 3 03-18-2011 03:29 PM
Why define a paragraph as a span with no different or extra formatting? bfollowell ePub 7 03-16-2011 10:30 PM
'Heading color' and 'p class span' mufc Recipes 7 12-22-2010 09:02 PM
Span tags, h1s and emspaces ConorHughes ePub 11 09-30-2010 05:00 PM
STREET & CLAIRVOYANCE by Ryan A. Span Winter Self-Promotions by Authors and Publishers 36 09-01-2010 11:09 AM


All times are GMT -4. The time now is 06:59 PM.


MobileRead.com is a privately owned, operated and funded community.