View Single Post
Old 12-12-2010, 03:09 AM   #1
BuzzKill
Junior Member
BuzzKill began at the beginning.
 
Posts: 6
Karma: 10
Join Date: Oct 2010
Device: Kindle
Partial Feeds and Using Info from XML content

Hi,

I am not sure if this has been asked but, if so I couldn't find it. I am trying to download feeds from http://www.sciencebasedmedicine.org/, and my recipe is as follows:

Code:
#!/usr/bin/env  python

from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import Tag

class SBM(BasicNewsRecipe):
    title                 = 'Science Based Medicine'
    __author__            = 'Multiple Authors'
    oldest_article        = 5
    max_articles_per_feed = 15
    no_stylesheets        = True
    use_embedded_content  = False
    encoding              = 'utf-8'
    publisher             = 'SBM'
    category              = 'science, sbm, ebm, blog'
    language              = 'en'

    lang                  = 'en-US'

    conversion_options = {
                          'tags'             : category
                        , 'publisher'        : publisher
                        , 'language'         : lang
                        , 'pretty_print'     : True
                        }

    keep_only_tags = [dict(name='div', attrs={'class':'entry'})]

    feeds = [(u'Science Based Medicine', u'http://www.sciencebasedmedicine.org/?feed=rss2')]

    def preprocess_html(self, soup):
        mtag = Tag(soup,'meta',[('http-equiv','Content-Type'),('context','text/html; charset=utf-8')])
        soup.head.insert(0,mtag)
        soup.html['lang'] = self.lang
        return self.adeify_images(soup)
I got this code by looking at other recipes, by no means am I well versed in python. Although, this works at extracting the full post contents, I want to add another bit of info at the beginning of each post: The author of the post. Now, the xml source has a line like this, which gives the author of the post:

Code:
  <dc:creator>Kimball Atwood</dc:creator>
Is it possible to add this info to the post itself? If not, how can I extract that from the post itself? For example, at http://www.sciencebasedmedicine.org/?p=8874, the code that mentions the author starts like this:

Code:
<div class="meta">
            Published by <a href=
            "http://www.sciencebasedmedicine.org/?author=6" title=
            "Posts by Kimball Atwood">Kimball Atwood</a> under 
.....
The "div" tag does not close before adding a lot of useless info, categories etc, and I only want the author's name.

Any clue would be much appreciated.

BuzzKill
BuzzKill is offline   Reply With Quote