Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 12-12-2010, 03:09 AM   #1
BuzzKill
Junior Member
BuzzKill began at the beginning.
 
Posts: 6
Karma: 10
Join Date: Oct 2010
Device: Kindle
Partial Feeds and Using Info from XML content

Hi,

I am not sure if this has been asked but, if so I couldn't find it. I am trying to download feeds from http://www.sciencebasedmedicine.org/, and my recipe is as follows:

Code:
#!/usr/bin/env  python

from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import Tag

class SBM(BasicNewsRecipe):
    title                 = 'Science Based Medicine'
    __author__            = 'Multiple Authors'
    oldest_article        = 5
    max_articles_per_feed = 15
    no_stylesheets        = True
    use_embedded_content  = False
    encoding              = 'utf-8'
    publisher             = 'SBM'
    category              = 'science, sbm, ebm, blog'
    language              = 'en'

    lang                  = 'en-US'

    conversion_options = {
                          'tags'             : category
                        , 'publisher'        : publisher
                        , 'language'         : lang
                        , 'pretty_print'     : True
                        }

    keep_only_tags = [dict(name='div', attrs={'class':'entry'})]

    feeds = [(u'Science Based Medicine', u'http://www.sciencebasedmedicine.org/?feed=rss2')]

    def preprocess_html(self, soup):
        mtag = Tag(soup,'meta',[('http-equiv','Content-Type'),('context','text/html; charset=utf-8')])
        soup.head.insert(0,mtag)
        soup.html['lang'] = self.lang
        return self.adeify_images(soup)
I got this code by looking at other recipes, by no means am I well versed in python. Although, this works at extracting the full post contents, I want to add another bit of info at the beginning of each post: The author of the post. Now, the xml source has a line like this, which gives the author of the post:

Code:
  <dc:creator>Kimball Atwood</dc:creator>
Is it possible to add this info to the post itself? If not, how can I extract that from the post itself? For example, at http://www.sciencebasedmedicine.org/?p=8874, the code that mentions the author starts like this:

Code:
<div class="meta">
            Published by <a href=
            "http://www.sciencebasedmedicine.org/?author=6" title=
            "Posts by Kimball Atwood">Kimball Atwood</a> under 
.....
The "div" tag does not close before adding a lot of useless info, categories etc, and I only want the author's name.

Any clue would be much appreciated.

BuzzKill
BuzzKill is offline   Reply With Quote
Old 12-12-2010, 09:18 AM   #2
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by BuzzKill View Post
Although, this works at extracting the full post contents, I want to add another bit of info at the beginning of each post: The author of the post.
It's already in the post, you're removing it with your "keep_only_tags" line.

If you don't like the additional stuff in the div tag, you could keep the name by keeping only the <a> tag with the "Posts by" title using this:
Code:
    keep_only_tags = [
                      dict(name='a', attrs={'title':re.compile(r'Posts by.*', re.DOTALL|re.IGNORECASE)}), 
                      dict(name='div', attrs={'class':'entry'})
                      ]
I used a regex so don't forget to add this at the top:
Code:
import re
Starson17 is offline   Reply With Quote
Advert
Old 12-12-2010, 09:44 AM   #3
BuzzKill
Junior Member
BuzzKill began at the beginning.
 
Posts: 6
Karma: 10
Join Date: Oct 2010
Device: Kindle
Starson17,

Thank you very much for the answer. That did it. I knew regular expressions could be used, but I just don't understand them yet.
BuzzKill is offline   Reply With Quote
Old 12-12-2010, 10:05 AM   #4
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by BuzzKill View Post
Starson17,

Thank you very much for the answer. That did it. I knew regular expressions could be used, but I just don't understand them yet.
When your recipe is done, you should submit it here. I enjoyed reading some of the posts. (I needed to see the page to understand your problem.)
Starson17 is offline   Reply With Quote
Reply

Tags
calibre, recipe, xml


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Getting Full Content from Partial Content Feeds thread314 Calibre 5 05-05-2012 10:49 AM
Read full-content feeds on iPhone Kindle App bthoven Apple Devices 15 08-08-2010 04:11 AM
Is there a good way to convert partial rss to full rss feeds. Zorz Other formats 5 05-29-2010 12:17 PM
A rather partial review of the 700 akira28 Sony Reader 6 04-14-2009 05:19 AM
iLiad Partial screen refresh? hansel iRex Developer's Corner 11 09-15-2008 09:51 AM


All times are GMT -4. The time now is 06:43 PM.


MobileRead.com is a privately owned, operated and funded community.