Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre

Notices

Reply
 
Thread Tools Search this Thread
Old 03-02-2010, 02:07 PM   #1
oddeyed
hi there!
oddeyed began at the beginning.
 
Posts: 17
Karma: 10
Join Date: Feb 2010
Location: London, UK
Device: Kindle 3rd Generation - 3G + Wifi - Graphite
Post Delete News Sections

Hi everyone,

So when I do eventually take the plunge and get a reader, I want to be able to access news feeds on it.

I would use the feedbooks self-updating automagical Newspapers, but seeing as my soon-to-be-reader, even if it is a Kindle, won't have that functionality (I'm not in the US ) I thought I'd use the Calibre news output because it can have pictures .

I have been doing test creations with the Guardian feeds on my laptop, and without fail, they include the sports section, G2 and the Entertainment which I don't want, even though I have edited the script to not get these.

So, does anyone know how to stop this from happening?

Thanks,
oddeyed

Below is my custom recipe:
Spoiler:
Code:
#!/usr/bin/env  python
__license__   = 'GPL v3'
__copyright__ = '2008, Kovid Goyal kovid@kovidgoyal.net'
__docformat__ = 'restructuredtext en'

'''
www.guardian.co.uk
'''
from calibre import strftime
from calibre.web.feeds.news import BasicNewsRecipe

class Guardian(BasicNewsRecipe):

    title = u'The Guardian - Top Stories'
    language = 'en_GB'

    oldest_article = 2
    max_articles_per_feed = 5
    remove_javascript = True

    timefmt = ' [%a, %d %b %Y]'
    keep_only_tags = [
                      dict(name='div', attrs={'id':["content","article_header","main-article-info",]}),
                           ]
    remove_tags = [
                        dict(name='div', attrs={'class':["video-content","videos-third-column"]}),
                        dict(name='div', attrs={'id':["article-toolbox","subscribe-feeds",]}),
                        dict(name='ul', attrs={'class':["pagination"]}),
                        dict(name='ul', attrs={'id':["content-actions"]}),
                        ]
    use_embedded_content    = False

    no_stylesheets = True
    extra_css = '''
                    .article-attributes{font-size: x-small; font-family:Arial,Helvetica,sans-serif;}
                    .h1{font-size: large ;font-family:georgia,serif; font-weight:bold;}
                    .stand-first-alone{color:#666666; font-size:small; font-family:Arial,Helvetica,sans-serif;}
                    .caption{color:#666666; font-size:x-small; font-family:Arial,Helvetica,sans-serif;}
                    #article-wrapper{font-size:small; font-family:Arial,Helvetica,sans-serif;font-weight:normal;}
                    .main-article-info{font-family:Arial,Helvetica,sans-serif;}
                    #full-contents{font-size:small; font-family:Arial,Helvetica,sans-serif;font-weight:normal;}
                    #match-stats-summary{font-size:small; font-family:Arial,Helvetica,sans-serif;font-weight:normal;}
                '''

    feeds = [
        ('Top Stories','http://www.guardian.co.uk/theguardian/mainsection/topstories/rss'
         ),
        ]

    def get_article_url(self, article):
          url = article.get('guid', None)
          if '/video/' in url or '/flyer/' in url or '/quiz/' in url or \
              '/gallery/' in url  or 'ivebeenthere' in url or \
              'pickthescore' in url or 'audioslideshow' in url or \
	    '/sport' in url or 'educationguardian' in url or 'football'\
	    or  '/films' in url:
              url = None
          return url

    def preprocess_html(self, soup):

          for item in soup.findAll(style=True):
              del item['style']

          for item in soup.findAll(face=True):
              del item['face']
          for tag in soup.findAll(name=['ul','li']):
                tag.name = 'div'

          return soup

    def find_sections(self):
        soup = self.index_to_soup('http://www.guardian.co.uk/theguardian')
        # find cover pic
        img = soup.find( 'img',attrs ={'alt':'Guardian digital edition'})
        if img is not None:
            self.cover_url = img['src']
        # end find cover pic

        idx = soup.find('div', id='book-index')
        for s in idx.findAll('strong', attrs={'class':'book'}):
            a = s.find('a', href=True)
            yield (self.tag_to_string(a), a['href'])

    def find_articles(self, url):
        soup = self.index_to_soup(url)
        div = soup.find('div', attrs={'class':'book-index'})
        for ul in div.findAll('ul', attrs={'class':'trailblock'}):
            for li in ul.findAll('li'):
                a = li.find(href=True)
                if not a:
                    continue
                title = self.tag_to_string(a)
                url = a['href']
                if not title or not url:
                    continue
                tt = li.find('div', attrs={'class':'trailtext'})
                if tt is not None:
                    for da in tt.findAll('a'): da.extract()
                    desc = self.tag_to_string(tt).strip()
                yield {
                        'title': title, 'url':url, 'description':desc,
                        'date' : strftime('%a, %d %b'),
                        }

    def parse_index(self):
        try:
            feeds = []
            for title, href in self.find_sections():
                feeds.append((title, list(self.find_articles(href))))
            return feeds
        except:
            raise NotImplementedError
oddeyed is offline   Reply With Quote
Old 03-03-2010, 03:21 AM   #2
DoctorOhh
US Navy, Retired
DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.
 
DoctorOhh's Avatar
 
Posts: 9,864
Karma: 13806776
Join Date: Feb 2009
Location: North Carolina
Device: Icarus Illumina XL HD, Nexus 7
Quote:
Originally Posted by oddeyed View Post
I have been doing test creations with the Guardian feeds on my laptop, and without fail, they include the sports section, G2 and the Entertainment which I don't want, even though I have edited the script to not get these.

So, does anyone know how to stop this from happening?
I'm guessing that after you have customized the recipe that you are still going into the English (UK) section for your recipe instead of grabbing your custom recipe from the Custom Recipe area (see attached).

The file you are making changes to can only be accessed through the Custom Recipe section. The original in the English (UK) section never changes. Notice the custom recipe does not have the little G icon next to it.
Attached Thumbnails
Click image for larger version

Name:	custom_recipe-2.png
Views:	263
Size:	41.4 KB
ID:	47000  

Last edited by DoctorOhh; 03-03-2010 at 03:25 AM.
DoctorOhh is offline   Reply With Quote
Advert
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Delete news older than Stingo Calibre 2 12-25-2010 05:13 AM
Calibre News Fetch - Delete strawfordt Calibre 1 06-22-2010 02:21 PM
Auto-delete old news from device raduma Calibre 3 12-16-2009 08:34 PM
Delete expired news items from device fruggeri Calibre 2 10-12-2009 06:30 PM
"Delete news when sent" option question Hypernova Calibre 1 04-03-2009 11:14 PM


All times are GMT -4. The time now is 08:32 PM.


MobileRead.com is a privately owned, operated and funded community.