Partial Feeds and Using Info from XML content

BuzzKill · 12-12-2010, 03:09 AM

Hi,

I am not sure if this has been asked but, if so I couldn't find it. I am trying to download feeds from http://www.sciencebasedmedicine.org/, and my recipe is as follows:

Code:

#!/usr/bin/env  python

from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import Tag

class SBM(BasicNewsRecipe):
    title                 = 'Science Based Medicine'
    __author__            = 'Multiple Authors'
    oldest_article        = 5
    max_articles_per_feed = 15
    no_stylesheets        = True
    use_embedded_content  = False
    encoding              = 'utf-8'
    publisher             = 'SBM'
    category              = 'science, sbm, ebm, blog'
    language              = 'en'

    lang                  = 'en-US'

    conversion_options = {
                          'tags'             : category
                        , 'publisher'        : publisher
                        , 'language'         : lang
                        , 'pretty_print'     : True
                        }

    keep_only_tags = [dict(name='div', attrs={'class':'entry'})]

    feeds = [(u'Science Based Medicine', u'http://www.sciencebasedmedicine.org/?feed=rss2')]

    def preprocess_html(self, soup):
        mtag = Tag(soup,'meta',[('http-equiv','Content-Type'),('context','text/html; charset=utf-8')])
        soup.head.insert(0,mtag)
        soup.html['lang'] = self.lang
        return self.adeify_images(soup)

I got this code by looking at other recipes, by no means am I well versed in python. Although, this works at extracting the full post contents, I want to add another bit of info at the beginning of each post: The author of the post. Now, the xml source has a line like this, which gives the author of the post:

Code:

  <dc:creator>Kimball Atwood</dc:creator>

Is it possible to add this info to the post itself? If not, how can I extract that from the post itself? For example, at http://www.sciencebasedmedicine.org/?p=8874, the code that mentions the author starts like this:

Code:

<div class="meta">
            Published by <a href=
            "http://www.sciencebasedmedicine.org/?author=6" title=
            "Posts by Kimball Atwood">Kimball Atwood</a> under 
.....

The "div" tag does not close before adding a lot of useless info, categories etc, and I only want the author's name.

Any clue would be much appreciated.

BuzzKill

Starson17 · 12-12-2010, 09:18 AM

Quote:

Originally Posted by BuzzKill

Although, this works at extracting the full post contents, I want to add another bit of info at the beginning of each post: The author of the post.

It's already in the post, you're removing it with your "keep_only_tags" line.

If you don't like the additional stuff in the div tag, you could keep the name by keeping only the <a> tag with the "Posts by" title using this:

Code:

    keep_only_tags = [
                      dict(name='a', attrs={'title':re.compile(r'Posts by.*', re.DOTALL|re.IGNORECASE)}), 
                      dict(name='div', attrs={'class':'entry'})
                      ]

I used a regex so don't forget to add this at the top:

Code:

import re

BuzzKill · 12-12-2010, 09:44 AM

Starson17,

Thank you very much for the answer. That did it. I knew regular expressions could be used, but I just don't understand them yet.

Starson17 · 12-12-2010, 10:05 AM

Quote:

Originally Posted by BuzzKill

Starson17,

Thank you very much for the answer. That did it. I knew regular expressions could be used, but I just don't understand them yet.

When your recipe is done, you should submit it here. I enjoyed reading some of the posts. (I needed to see the page to understand your problem.)

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Getting Full Content from Partial Content Feeds	thread314	Calibre	5	05-05-2012 10:49 AM
Read full-content feeds on iPhone Kindle App	bthoven	Apple Devices	15	08-08-2010 04:11 AM
Is there a good way to convert partial rss to full rss feeds.	Zorz	Other formats	5	05-29-2010 12:17 PM
A rather partial review of the 700	akira28	Sony Reader	6	04-14-2009 05:19 AM
iLiad Partial screen refresh?	hansel	iRex Developer's Corner	11	09-15-2008 09:51 AM

12-12-2010, 09:44 AM	#3
BuzzKill Junior Member Posts: 6 Karma: 10 Join Date: Oct 2010 Device: Kindle	Starson17, Thank you very much for the answer. That did it. I knew regular expressions could be used, but I just don't understand them yet.

Advert