View Single Post
Old 09-22-2011, 01:03 PM   #6
a.peter
Enthusiast
a.peter began at the beginning.
 
Posts: 28
Karma: 10
Join Date: Sep 2011
Device: Sony PRS-350, Kindle Touch
Okay. I see your problem.

In fact, the return value of parse_index(self) is:

Code:
[
 ('title', [
            {'title':..., 'url':..., 'description':..., 'date':...},
            More dictionaries as above ...
           ]
 ),
 More tuples with genres
]
The url has to be a HTML page.

On each of these pages, the values of remove_tags and so on are executed, resulting in a cleaned HTML-page.

A working example would be:

Spoiler:
Code:
from calibre.ebooks.BeautifulSoup import Tag, NavigableString
import re

class Pagina12(BasicNewsRecipe):
    title      = 'Pagina/12 - Edicion Impresa'
    __author__ = 'Pablo Marfil'
    description = 'Diario argentino'
    INDEX = 'http://www.pagina12.com.ar/diario/secciones/index.html'
    language = 'es'
    encoding              = 'cp1252'
    keep_only_tags        = [dict(name='div', attrs={'id':'rudy_paz'})]
    masthead_url          = 'http://www.pagina12.com.ar/commons/imgs/logo-home.gif'	
    no_stylesheets = True

    #preprocess_regexps= [(re.compile(r'<!DOCTYPE[^>]+>', re.I), lambda m:'')]  

    def parse_index(self):
        feeds = [('Humor', [{'title':'Rudy y Daniel Paz', 'url':'http://www.pagina12.com.ar/diario/ultimas/index.html', 'description':'', 'date':''}])]
        print feeds
        raw_input('...')
        return feeds

Last edited by a.peter; 09-22-2011 at 01:21 PM.
a.peter is offline   Reply With Quote