Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 03-04-2011, 06:15 PM   #1
mufc
Connoisseur
mufc doesn't littermufc doesn't litter
 
Posts: 99
Karma: 170
Join Date: Nov 2010
Location: Airdrie Alberta
Device: Sony 650
Remove hyperlink properties from inside <i> etc

I know this:

Small piece of code to convert all links to text:

def preprocess_html(self, soup):
for alink in soup.findAll('a'):
if alink.string is not None:
tstr = alink.string
alink.replaceWith(tstr)
return soup

BUT how do you convert links to text that are hidden in 'h2', 'strong' 'i' etc

<h2>
<a href="http://www.filmcritic.com/reviews/in-theaters">In Theaters</a>
</h2>

OR like this

<a href="http://www.filmcritic.com/reviews/1937/snow-white-and-the-seven-dwarfs/"><i>Snow White</i></a>
mufc is offline   Reply With Quote
Old 03-07-2011, 01:09 AM   #2
thearr
Member
thearr once ate a cherry pie in a record 7 seconds.thearr once ate a cherry pie in a record 7 seconds.thearr once ate a cherry pie in a record 7 seconds.thearr once ate a cherry pie in a record 7 seconds.thearr once ate a cherry pie in a record 7 seconds.thearr once ate a cherry pie in a record 7 seconds.thearr once ate a cherry pie in a record 7 seconds.thearr once ate a cherry pie in a record 7 seconds.thearr once ate a cherry pie in a record 7 seconds.thearr once ate a cherry pie in a record 7 seconds.thearr once ate a cherry pie in a record 7 seconds.
 
Posts: 22
Karma: 1756
Join Date: Jan 2011
Location: Moscow, RU
Device: Kindle3, iPhone4, iPad2
Quote:
Originally Posted by mufc View Post
BUT how do you convert links to text that are hidden in 'h2', 'strong' 'i' etc

<h2>
<a href="http://www.filmcritic.com/reviews/in-theaters">In Theaters</a>
</h2>
I use regular expressions. If you just want to remove hyperlink properties, then it is possible to do, for example, this way:
Code:
preprocess_regexps = [
        (re.compile(r'<a.*?>'), lambda h1: ''),
        (re.compile(r'</a>'), lambda h2: '')]

<h2>
In Theaters
</h2>
If you want to preserve a link as a text, then you can do smth like this:
Code:
preprocess_regexps = [
        (re.compile(r'(<a href=")([^"]+)(">)(.*)(</a>)'), 
           lambda h: '%s (%s)' % (h.group(4), h.group(2))]

<h2>
In Theaters (http://www.filmcritic.com/reviews/in-theaters)
</h2>
This is my the DogHouseDiaries webcomics recipe with something similar I wrote above:
Spoiler:
import time, re
from calibre.web.feeds.news import BasicNewsRecipe

class DogHouse(BasicNewsRecipe):
title = 'DOGHOUSEDIARIES'
description = 'The comic is simply a commentary on the love life of sandwiches by Ray, Raf & Will'
__author__ = 'thearr'
language = 'en'

use_embedded_content = False
oldest_article = 0
max_articles_per_feed = 4
keep_only_tags = [dict(name='div', attrs={'class':'object'}),
dict(name='div', attrs={'class':'entry'})]
remove_tags = [dict(name='div', attrs={'class':'sociable'})]
no_stylesheets = True

preprocess_regexps = [
(re.compile(r'<a.*?>'),
lambda h1: ''),
(re.compile(r'</a>'),
lambda h2: ''),
(re.compile(r'(<img.*title=")([^"]+)(".*>)'),
lambda m: '%s%s<p><i><font color="grey">%s</font></i></p>' % (m.group(1), m.group(3), m.group(2)))
]

def parse_index(self):
soup = self.index_to_soup('http://www.thedoghousediaries.com/?p=34')
book = []
for comic in soup.findAll('option'):
if comic['value'] != '0':
book.append({
'date': '2011-02-15',
'url': comic['value'],
'title': self.tag_to_string(comic),
'description': '',
'content': ''
})
return [('DOGHOUSEDIARIES', book)]
thearr is offline   Reply With Quote
Advert
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Remove color behind hyperlink mufc Recipes 4 11-24-2010 07:56 AM
Adding properties to books IzzyMad Workshop 3 10-15-2010 11:05 AM
Hyperlink? fcoulter Sigil 3 03-28-2010 10:31 AM
Unsetting properties in CSS Jellby ePub 2 06-03-2009 04:29 AM
Changing pdf properties Puddytat purr PDF 2 02-22-2008 09:27 AM


All times are GMT -4. The time now is 04:20 PM.


MobileRead.com is a privately owned, operated and funded community.