View Single Post
Old 05-29-2011, 10:41 AM   #2
Bonex
Connoisseur
Bonex began at the beginning.
 
Posts: 63
Karma: 10
Join Date: Oct 2010
Device: KDXG, Kobo Glo, Kobo Aura HD
Add the preprocess_regexps option:

Code:
  preprocess_regexps = [ (re.compile(r'</?a[^>]*>'),lambda match: ''),
                         (re.compile(r'<span[^>]*article-link-id.*?<br\s*\/?><br\s*\/?>'), lambda match: '')]

  keep_only_tags = [dict(name='div', attrs={'class':'article'})]

  remove_tags = [
   dict(name='p',attrs={'class':'meta links'}),
   dict(name='div',attrs={'class':'float-right'}),
   #dict(name='span',attrs={'class':'article-link-id'})
  ]

  feeds = [
The first one removes all <a> and </a> tags leaving the text inside, which I think is what you wanted to do with the preprocess_html function, the second ugly one removes all <span class="article-link-id">blabla</span> followed by two <br /> tags.
If you want a suggestion, you can add an extra_css option to tweak the final appearence of the article when displayed.
Bonex is offline   Reply With Quote