Hi,
I am not sure if this has been asked but, if so I couldn't find it. I am trying to download feeds from
http://www.sciencebasedmedicine.org/, and my recipe is as follows:
Code:
#!/usr/bin/env python
from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import Tag
class SBM(BasicNewsRecipe):
title = 'Science Based Medicine'
__author__ = 'Multiple Authors'
oldest_article = 5
max_articles_per_feed = 15
no_stylesheets = True
use_embedded_content = False
encoding = 'utf-8'
publisher = 'SBM'
category = 'science, sbm, ebm, blog'
language = 'en'
lang = 'en-US'
conversion_options = {
'tags' : category
, 'publisher' : publisher
, 'language' : lang
, 'pretty_print' : True
}
keep_only_tags = [dict(name='div', attrs={'class':'entry'})]
feeds = [(u'Science Based Medicine', u'http://www.sciencebasedmedicine.org/?feed=rss2')]
def preprocess_html(self, soup):
mtag = Tag(soup,'meta',[('http-equiv','Content-Type'),('context','text/html; charset=utf-8')])
soup.head.insert(0,mtag)
soup.html['lang'] = self.lang
return self.adeify_images(soup)
I got this code by looking at other recipes, by no means am I well versed in python. Although, this works at extracting the full post contents, I want to add another bit of info at the beginning of each post: The author of the post. Now, the xml source has a line like this, which gives the author of the post:
Code:
<dc:creator>Kimball Atwood</dc:creator>
Is it possible to add this info to the post itself? If not, how can I extract that from the post itself? For example, at
http://www.sciencebasedmedicine.org/?p=8874, the code that mentions the author starts like this:
Code:
<div class="meta">
Published by <a href=
"http://www.sciencebasedmedicine.org/?author=6" title=
"Posts by Kimball Atwood">Kimball Atwood</a> under
.....
The "div" tag does not close before adding a lot of useless info, categories etc, and I only want the author's name.
Any clue would be much appreciated.
BuzzKill