MobileRead Forums

MobileRead Forums (https://www.mobileread.com/forums/index.php)
-   Recipes (https://www.mobileread.com/forums/forumdisplay.php?f=228)
-   -   Partial Feeds and Using Info from XML content (https://www.mobileread.com/forums/showthread.php?t=110874)

BuzzKill 12-12-2010 04:09 AM

Partial Feeds and Using Info from XML content
 
Hi,

I am not sure if this has been asked but, if so I couldn't find it. I am trying to download feeds from http://www.sciencebasedmedicine.org/, and my recipe is as follows:

Code:

#!/usr/bin/env  python

from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import Tag

class SBM(BasicNewsRecipe):
    title                = 'Science Based Medicine'
    __author__            = 'Multiple Authors'
    oldest_article        = 5
    max_articles_per_feed = 15
    no_stylesheets        = True
    use_embedded_content  = False
    encoding              = 'utf-8'
    publisher            = 'SBM'
    category              = 'science, sbm, ebm, blog'
    language              = 'en'

    lang                  = 'en-US'

    conversion_options = {
                          'tags'            : category
                        , 'publisher'        : publisher
                        , 'language'        : lang
                        , 'pretty_print'    : True
                        }

    keep_only_tags = [dict(name='div', attrs={'class':'entry'})]

    feeds = [(u'Science Based Medicine', u'http://www.sciencebasedmedicine.org/?feed=rss2')]

    def preprocess_html(self, soup):
        mtag = Tag(soup,'meta',[('http-equiv','Content-Type'),('context','text/html; charset=utf-8')])
        soup.head.insert(0,mtag)
        soup.html['lang'] = self.lang
        return self.adeify_images(soup)

I got this code by looking at other recipes, by no means am I well versed in python. Although, this works at extracting the full post contents, I want to add another bit of info at the beginning of each post: The author of the post. Now, the xml source has a line like this, which gives the author of the post:

Code:

  <dc:creator>Kimball Atwood</dc:creator>
Is it possible to add this info to the post itself? If not, how can I extract that from the post itself? For example, at http://www.sciencebasedmedicine.org/?p=8874, the code that mentions the author starts like this:

Code:

<div class="meta">
            Published by <a href=
            "http://www.sciencebasedmedicine.org/?author=6" title=
            "Posts by Kimball Atwood">Kimball Atwood</a> under
.....

The "div" tag does not close before adding a lot of useless info, categories etc, and I only want the author's name.

Any clue would be much appreciated.

BuzzKill

Starson17 12-12-2010 10:18 AM

Quote:

Originally Posted by BuzzKill (Post 1267105)
Although, this works at extracting the full post contents, I want to add another bit of info at the beginning of each post: The author of the post.

It's already in the post, you're removing it with your "keep_only_tags" line.

If you don't like the additional stuff in the div tag, you could keep the name by keeping only the <a> tag with the "Posts by" title using this:
Code:

    keep_only_tags = [
                      dict(name='a', attrs={'title':re.compile(r'Posts by.*', re.DOTALL|re.IGNORECASE)}),
                      dict(name='div', attrs={'class':'entry'})
                      ]

I used a regex so don't forget to add this at the top:
Code:

import re

BuzzKill 12-12-2010 10:44 AM

Starson17,

Thank you very much for the answer. That did it. I knew regular expressions could be used, but I just don't understand them yet.

Starson17 12-12-2010 11:05 AM

Quote:

Originally Posted by BuzzKill (Post 1267504)
Starson17,

Thank you very much for the answer. That did it. I knew regular expressions could be used, but I just don't understand them yet.

When your recipe is done, you should submit it here. I enjoyed reading some of the posts. (I needed to see the page to understand your problem.)


All times are GMT -4. The time now is 10:14 PM.

Powered by: vBulletin
Copyright ©2000 - 3.8.5, Jelsoft Enterprises Ltd.
MobileRead.com is a privately owned, operated and funded community.