View Single Post
Old 03-07-2011, 01:09 AM   #2
thearr
Member
thearr once ate a cherry pie in a record 7 seconds.thearr once ate a cherry pie in a record 7 seconds.thearr once ate a cherry pie in a record 7 seconds.thearr once ate a cherry pie in a record 7 seconds.thearr once ate a cherry pie in a record 7 seconds.thearr once ate a cherry pie in a record 7 seconds.thearr once ate a cherry pie in a record 7 seconds.thearr once ate a cherry pie in a record 7 seconds.thearr once ate a cherry pie in a record 7 seconds.thearr once ate a cherry pie in a record 7 seconds.thearr once ate a cherry pie in a record 7 seconds.
 
Posts: 22
Karma: 1756
Join Date: Jan 2011
Location: Moscow, RU
Device: Kindle3, iPhone4, iPad2
Quote:
Originally Posted by mufc View Post
BUT how do you convert links to text that are hidden in 'h2', 'strong' 'i' etc

<h2>
<a href="http://www.filmcritic.com/reviews/in-theaters">In Theaters</a>
</h2>
I use regular expressions. If you just want to remove hyperlink properties, then it is possible to do, for example, this way:
Code:
preprocess_regexps = [
        (re.compile(r'<a.*?>'), lambda h1: ''),
        (re.compile(r'</a>'), lambda h2: '')]

<h2>
In Theaters
</h2>
If you want to preserve a link as a text, then you can do smth like this:
Code:
preprocess_regexps = [
        (re.compile(r'(<a href=")([^"]+)(">)(.*)(</a>)'), 
           lambda h: '%s (%s)' % (h.group(4), h.group(2))]

<h2>
In Theaters (http://www.filmcritic.com/reviews/in-theaters)
</h2>
This is my the DogHouseDiaries webcomics recipe with something similar I wrote above:
Spoiler:
import time, re
from calibre.web.feeds.news import BasicNewsRecipe

class DogHouse(BasicNewsRecipe):
title = 'DOGHOUSEDIARIES'
description = 'The comic is simply a commentary on the love life of sandwiches by Ray, Raf & Will'
__author__ = 'thearr'
language = 'en'

use_embedded_content = False
oldest_article = 0
max_articles_per_feed = 4
keep_only_tags = [dict(name='div', attrs={'class':'object'}),
dict(name='div', attrs={'class':'entry'})]
remove_tags = [dict(name='div', attrs={'class':'sociable'})]
no_stylesheets = True

preprocess_regexps = [
(re.compile(r'<a.*?>'),
lambda h1: ''),
(re.compile(r'</a>'),
lambda h2: ''),
(re.compile(r'(<img.*title=")([^"]+)(".*>)'),
lambda m: '%s%s<p><i><font color="grey">%s</font></i></p>' % (m.group(1), m.group(3), m.group(2)))
]

def parse_index(self):
soup = self.index_to_soup('http://www.thedoghousediaries.com/?p=34')
book = []
for comic in soup.findAll('option'):
if comic['value'] != '0':
book.append({
'date': '2011-02-15',
'url': comic['value'],
'title': self.tag_to_string(comic),
'description': '',
'content': ''
})
return [('DOGHOUSEDIARIES', book)]
thearr is offline   Reply With Quote