MobileRead Forums - View Single Post - How to convert newspaper which do not have RSS feed?

oneillpt · 03-07-2011, 10:41 AM

Quote:

Originally Posted by bthoven

With Calibre, we can easily convert newspapers, with RSS feeds, to enews.

As there are many newspapers which do not provide RSS feeds on their website, is there anyway to automatically generate feeds from such websites and then use Calibre to convert them to full article enews?

You need to override the parse_index procedure. The NYTimes example in the Calibre User Manual, http://calibre-ebook.com/user_manual/news.html, shows how this can be done. Grep parse_index in the built-in recipes to find more examples.

As a simpler example may be helpful, I have added a recipe for Babelia en El Pais, recently requested in this forum, at the end of this reply, and I have also added comments on this recipe immediately below to help you understand the process (note that indentation is important in Python, but lost in these comments. See the code for the correct indentation). As the site does not return any duplicate links, I have kept the recipe simple by not checking for duplicate links. See some of the built-in recipes to see how duplicate checking can be carried out.

I hope this helps:

(1) import the basic recipe and needed parts from BeautifulSoup

from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import Tag, NavigableString

(2) declare your class, derived from BasaicNewsRecipe, and set the variable INDEX to the url for the site page with links

class ElPaisBabelia(BasicNewsRecipe):

title = 'El Pais Babelia'
__author__ = 'oneillpt'
description = 'El Pais Babelia'
INDEX = 'http://www.elpais.com/suple/babelia/'
language = 'es'

(3) examining the page source for the individual article pages we find that the text, with some additional matter not required, is contained in a DIV section with class="estructura_2col". keep_tags specifies that we work with this section, remove_tags_before removes some links which would otherwise appear before the article. Note that we deal with article extraction here, before we deal with link extraction later by overriding parse_index

remove_tags_before = dict(name='div', attrs={'class':'estructura_2col'})
keep_tags = [dict(name='div', attrs={'class':'estructura_2col'})]

(4) remove_tags removes the additional matter not required for the article. Add this after examining the generated article output, identifying the unwanted matter in the original page source

remove_tags = [dict(name='div', attrs={'class':'votos estirar'}),
dict(name='div', attrs={'id':'utilidades'}),
dict(name='div', attrs={'class':'info_relacionada'}),
dict(name='div', attrs={'class':'mod_apoyo'}),
dict(name='div', attrs={'class':'contorno_f'}),
dict(name='div', attrs={'class':'pestanias'}),
dict(name='div', attrs={'class':'otros_webs'}),
dict(name='div', attrs={'id':'pie'})
]

(5) you will probably want to remove javascript, and may want to disable loading of stylesheets. Here, this does not make much difference, so I have retained the line for future use if desired, but made it a comment using "#"

#no_stylesheets = True
remove_javascript = True

(6) parse_index finds the article links, using the INDEX variable, and looking for links in a DIV with class="contenedor_nuevo". No cover image is specified. All subsequent lines here are part of parse_index. See the code for the correct indentation structure

def parse_index(self):
articles = []
soup = self.index_to_soup(self.INDEX)
cover = None
feeds = []
for section in soup.findAll('div', attrs={'class':'contenedor_nuevo'}):
section_title = self.tag_to_string(section.find('h1'))
articles = []

(7) all article links have a "href" attribute

for post in section.findAll('a', href=True):
url = post['href']

(8) other links may also have a "href" attribute, but article links will start with "/", and need the base url appended

if url.startswith('/'):
url = 'http://www.elpais.es'+url
title = self.tag_to_string(post)

(9) we may still have too many links, but all article links will have a class attribute. This class attribute changes, so we just check for existence, not value. Two points to note are that the class variable has been named klass as class appears to be a reserved keyword in this context, and that post['class'] will cause an error if there is no class attribute. So we first convert the post soup to a string, and check whether it contains "class="

if str(post).find('class=') > 0:
klass = post['class']
if klass != "":

(10) you may find it useful to log output to see what is happening. This output will appear in the job details when built with Calibre. Remember that you can also perform manual extraction from a command prompt:

ebook-extract ElPaisBabelia.recipe ELPB --test -vv

and in this case you can examine the html source for the two articles which will be extracted in the ELPB folder structure

self.log()
self.log('--> post: ', post)
self.log('--> url: ', url)
self.log('--> title: ', title)
self.log('--> class: ', klass)

(11) build the list of article links

articles.append({'title':title, 'url':url})

(12) and if any article links have been found, append the article list to the feed list, which is finally returned

if articles:
feeds.append((section_title, articles))
return feeds

Code:

from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import Tag, NavigableString

class ElPaisBabelia(BasicNewsRecipe):

    title      = 'El Pais Babelia'
    __author__ = 'oneillpt'
    description = 'El Pais Babelia'
    INDEX = 'http://www.elpais.com/suple/babelia/'
    language = 'es'

    remove_tags_before = dict(name='div', attrs={'class':'estructura_2col'})
    keep_tags = [dict(name='div', attrs={'class':'estructura_2col'})]
    remove_tags = [dict(name='div', attrs={'class':'votos estirar'}),
        dict(name='div', attrs={'id':'utilidades'}),
        dict(name='div', attrs={'class':'info_relacionada'}),
        dict(name='div', attrs={'class':'mod_apoyo'}),
        dict(name='div', attrs={'class':'contorno_f'}),
        dict(name='div', attrs={'class':'pestanias'}),
        dict(name='div', attrs={'class':'otros_webs'}),
        dict(name='div', attrs={'id':'pie'})
        ]
    #no_stylesheets = True
    remove_javascript     = True

    def parse_index(self):
        articles = []
        soup = self.index_to_soup(self.INDEX)
        cover = None
        feeds = []
        for section in soup.findAll('div', attrs={'class':'contenedor_nuevo'}):
            section_title = self.tag_to_string(section.find('h1'))
            articles = []
            for post in section.findAll('a', href=True):
                url = post['href']
                if url.startswith('/'):
                  url = 'http://www.elpais.es'+url
                  title = self.tag_to_string(post)
                  if str(post).find('class=') > 0:
                    klass = post['class']
                    if klass != "":
                      self.log()
                      self.log('--> post:  ', post)
                      self.log('--> url:   ', url)
                      self.log('--> title: ', title)
                      self.log('--> class: ', klass)
                      articles.append({'title':title, 'url':url})
            if articles:
                feeds.append((section_title, articles))
        return feeds