20 Minutos - First steps

tolyluis · 01-24-2011, 11:06 PM

Hi again:

I have made a first-attempt to take 20 Minutos web page in a readable ebook. The result is very good, is (almost) perfect. It takes a lot of time (about 5 mins) and space to make a HUGE file* (about 4 Mb) of this well-know-in-internet online spanish newspaper. I think my work on it increases the offer in spanish newspapers for Calibre

. But it has some LIMITATIONS: a) Don't get comics (Viñetas), I don't know (yet); b) I don't take the local news feeds, is a HUGE file as is now

This is my recipe:

20minutos.es - One of the most visited spanish web newspaper

Code:

__license__   = 'GPL v3'
__author__    = 'Luis Hernandez'
__copyright__ = 'Luis Hernandez<tolyluis@gmail.com>'
description   = 'Periódico gratuito en español - v0.5 - 25 Jan 2011'

'''
www.20minutos.es
'''

class AdvancedUserRecipe1294946868(BasicNewsRecipe):

    title          = u'20 Minutos'
    publisher      = u'Grupo 20 Minutos'

    __author__            = 'Luis Hernández'
    description           = 'Periódico gratuito en español'
    cover_url     = 'http://estaticos.20minutos.es/mmedia/especiales/corporativo/css/img/logotipos_grupo20minutos.gif'

    oldest_article = 5
    max_articles_per_feed = 100

    remove_javascript = True
    no_stylesheets        = True
    use_embedded_content  = False

    encoding              = 'ISO-8859-1'
    language              = 'es'
    timefmt        = '[%a, %d %b, %Y]'

    keep_only_tags     = [dict(name='div', attrs={'id':['content']})
                                  ,dict(name='div', attrs={'class':['boxed','description','lead','article-content']})
                                  ,dict(name='span', attrs={'class':['photo-bar']})
                                  ,dict(name='ul', attrs={'class':['article-author']})                                
                                ]

    remove_tags_before = dict(name='ul' , attrs={'class':['servicios-sub']})
    remove_tags_after  = dict(name='div' , attrs={'class':['related-news','col']})

    remove_tags = [
                     dict(name='ol', attrs={'class':['navigation',]})
                    ,dict(name='span', attrs={'class':['action']})
                    ,dict(name='div', attrs={'class':['twitter comments-list hidden','related-news','col']})
                    ,dict(name='div', attrs={'id':['twitter-destacados']})
                    ,dict(name='ul', attrs={'class':['article-user-actions','stripped-list']})
                                          ]

    feeds = [
              (u'Portada'              , u'http://www.20minutos.es/rss/')
             ,(u'Nacional'             , u'http://www.20minutos.es/rss/nacional/')
             ,(u'Internacional'       , u'http://www.20minutos.es/rss/internacional/')
             ,(u'Economia'           , u'http://www.20minutos.es/rss/economia/')
             ,(u'Deportes'            , u'http://www.20minutos.es/rss/deportes/')
             ,(u'Tecnologia'          , u'http://www.20minutos.es/rss/tecnologia/')
             ,(u'Gente - TV'         , u'http://www.20minutos.es/rss/gente-television/')
             ,(u'Motor'                 , u'http://www.20minutos.es/rss/motor/')
             ,(u'Salud'                 , u'http://www.20minutos.es/rss/belleza-y-salud/')
             ,(u'Viajes'                , u'http://www.20minutos.es/rss/viajes/')
             ,(u'Vivienda'             , u'http://www.20minutos.es/rss/vivienda/')
             ,(u'Empleo'              , u'http://www.20minutos.es/rss/empleo/')
             ,(u'Cine'                  , u'http://www.20minutos.es/rss/cine/')
             ,(u'Musica'               , u'http://www.20minutos.es/rss/musica/')
             ,(u'Comunidad20'     , u'http://www.20minutos.es/rss/zona20/')
            ]

Of course, this recipe can be used for anybody to make it better!

*With oldest_article = 5, you can change the days to your needs.

kovidgoyal · 01-25-2011, 11:38 AM

Will be in next release

nadid · 01-26-2011, 10:20 AM

Tolyluis,

Please can you do the same for the viñetas/comics of 20 minutes? Maybe a different recipe?

Thank you.

tolyluis · 01-27-2011, 11:01 AM

Hi again.

I worked over this recipe last night, I've a new version WITH comics.

CHANGELOG

v0.8

- Adjust code for erase some indeseable content
- Added comics (viñetas) with bugs (may be repaired)

Source Code:

Code:

__license__   = 'GPL v3'
__author__    = 'Luis Hernandez'
__copyright__ = 'Luis Hernandez<tolyluis@gmail.com>'
description   = 'Periódico gratuito en español - v0.8 - 27 Jan 2011'

'''
www.20minutos.es
'''

class AdvancedUserRecipe1294946868(BasicNewsRecipe):

    title          = u'20 Minutos'
    publisher      = u'Grupo 20 Minutos'

    __author__            = 'Luis Hernández'
    description           = 'Periódico gratuito en español'
    cover_url     = 'http://estaticos.20minutos.es/mmedia/especiales/corporativo/css/img/logotipos_grupo20minutos.gif'

    oldest_article = 5
    max_articles_per_feed = 100

    remove_javascript = True
    no_stylesheets        = True
    use_embedded_content  = False

    encoding              = 'ISO-8859-1'
    language              = 'es'
    timefmt        = '[%a, %d %b, %Y]'

    keep_only_tags     = [
                                   dict(name='div', attrs={'id':['content','vinetas',]})
                                  ,dict(name='div', attrs={'class':['boxed','description','lead','article-content','cuerpo estirar']})
                                  ,dict(name='span', attrs={'class':['photo-bar']})
                                  ,dict(name='ul', attrs={'class':['article-author']})                                
                                ]

    remove_tags_before = dict(name='ul' , attrs={'class':['servicios-sub']})
    remove_tags_after  = dict(name='div' , attrs={'class':['related-news','col']})

    remove_tags = [
                     dict(name='ol', attrs={'class':['navigation',]})
                    ,dict(name='span', attrs={'class':['action']})
                    ,dict(name='div', attrs={'class':['twitter comments-list hidden','related-news','col','photo-gallery','calendario','article-comment','postto estirar','otras_vinetas estirar','kment','user-actions']})
                    ,dict(name='div', attrs={'id':['twitter-destacados','eco-tabs','inner','vineta_calendario','vinetistas clearfix','otras_vinetas estirar','MIN1','main','SUP1','INT']})
                    ,dict(name='ul', attrs={'class':['article-user-actions','stripped-list']})
                    ,dict(name='ul', attrs={'id':['site-links']})
                    ,dict(name='li', attrs={'class':['puntuacion','enviar','compartir']})
                       ]

    feeds = [
              (u'Portada'              , u'http://www.20minutos.es/rss/')
             ,(u'Nacional'             , u'http://www.20minutos.es/rss/nacional/')
             ,(u'Internacional'       , u'http://www.20minutos.es/rss/internacional/')
             ,(u'Economia'           , u'http://www.20minutos.es/rss/economia/')
             ,(u'Deportes'            , u'http://www.20minutos.es/rss/deportes/')
             ,(u'Tecnologia'          , u'http://www.20minutos.es/rss/tecnologia/')
             ,(u'Gente - TV'         , u'http://www.20minutos.es/rss/gente-television/')
             ,(u'Motor'                 , u'http://www.20minutos.es/rss/motor/')
             ,(u'Salud'                 , u'http://www.20minutos.es/rss/belleza-y-salud/')
             ,(u'Viajes'                , u'http://www.20minutos.es/rss/viajes/')
             ,(u'Vivienda'             , u'http://www.20minutos.es/rss/vivienda/')
             ,(u'Empleo'              , u'http://www.20minutos.es/rss/empleo/')
             ,(u'Cine'                  , u'http://www.20minutos.es/rss/cine/')
             ,(u'Musica'               , u'http://www.20minutos.es/rss/musica/')
             ,(u'Vinetas'              , u'http://www.20minutos.es/rss/vinetas/')
             ,(u'Comunidad20'     , u'http://www.20minutos.es/rss/zona20/')
            ]

May be comics be fixed with

(I'll try to open a new thread later)

Hope you enjoy this version. I will like some feedback.

nadid · 01-28-2011, 03:17 AM

I will test it

thanks for your work!.

I will give you some feedback tonight.

tolyluis · 01-28-2011, 12:34 PM

A little changes is necesary in the code for optimal perfomance in testing mode using command ebook-export, no changes made in the "real" code, just has been erased some non-ascii characters.

SOURCE CODE

Code:

__license__   = 'GPL v3'
__author__    = 'Luis Hernandez'
__copyright__ = 'Luis Hernandez<tolyluis@gmail.com>'


'''
www.20minutos.es
'''

class AdvancedUserRecipe1294946868(BasicNewsRecipe):

    title          = u'20 Minutos'
    publisher      = u'Grupo 20 Minutos'

    __author__            = 'Luis Hernandez'
    description           = 'Periodico gratuito independiente'
    cover_url     = 'http://estaticos.20minutos.es/mmedia/especiales/corporativo/css/img/logotipos_grupo20minutos.gif'

    oldest_article = 5
    max_articles_per_feed = 100

    remove_javascript = True
    no_stylesheets        = True
    use_embedded_content  = False

    encoding              = 'ISO-8859-1'
    language              = 'es'
    timefmt        = '[%a, %d %b, %Y]'

    keep_only_tags     = [
                                   dict(name='div', attrs={'id':['content','vinetas',]})
                                  ,dict(name='div', attrs={'class':['boxed','description','lead','article-content','cuerpo estirar']})
                                  ,dict(name='span', attrs={'class':['photo-bar']})
                                  ,dict(name='ul', attrs={'class':['article-author']})                                
                                ]

    remove_tags_before = dict(name='ul' , attrs={'class':['servicios-sub']})
    remove_tags_after  = dict(name='div' , attrs={'class':['related-news','col']})

    remove_tags = [
                     dict(name='ol', attrs={'class':['navigation',]})
                    ,dict(name='span', attrs={'class':['action']})
                    ,dict(name='div', attrs={'class':['twitter comments-list hidden','related-news','col','photo-gallery','calendario','article-comment','postto estirar','otras_vinetas estirar','kment','user-actions']})
                    ,dict(name='div', attrs={'id':['twitter-destacados','eco-tabs','inner','vineta_calendario','vinetistas clearfix','otras_vinetas estirar','MIN1','main','SUP1','INT']})
                    ,dict(name='ul', attrs={'class':['article-user-actions','stripped-list']})
                    ,dict(name='ul', attrs={'id':['site-links']})
                    ,dict(name='li', attrs={'class':['puntuacion','enviar','compartir']})
                       ]

    feeds = [
              (u'Portada'              , u'http://www.20minutos.es/rss/')
             ,(u'Nacional'             , u'http://www.20minutos.es/rss/nacional/')
             ,(u'Internacional'       , u'http://www.20minutos.es/rss/internacional/')
             ,(u'Economia'           , u'http://www.20minutos.es/rss/economia/')
             ,(u'Deportes'            , u'http://www.20minutos.es/rss/deportes/')
             ,(u'Tecnologia'          , u'http://www.20minutos.es/rss/tecnologia/')
             ,(u'Gente - TV'         , u'http://www.20minutos.es/rss/gente-television/')
             ,(u'Motor'                 , u'http://www.20minutos.es/rss/motor/')
             ,(u'Salud'                 , u'http://www.20minutos.es/rss/belleza-y-salud/')
             ,(u'Viajes'                , u'http://www.20minutos.es/rss/viajes/')
             ,(u'Vivienda'             , u'http://www.20minutos.es/rss/vivienda/')
             ,(u'Empleo'              , u'http://www.20minutos.es/rss/empleo/')
             ,(u'Cine'                  , u'http://www.20minutos.es/rss/cine/')
             ,(u'Musica'               , u'http://www.20minutos.es/rss/musica/')
             ,(u'Vinetas'              , u'http://www.20minutos.es/rss/vinetas/')
             ,(u'Comunidad20'     , u'http://www.20minutos.es/rss/zona20/')
            ]

Sorry for duplicating posts.

tolyluis · 01-31-2011, 02:32 PM

Hi all

I have a problem with this recipe, the original web page looks:

And my recipe shows it:

All the articles have the same problem, I have localized the guilty code responsible of this disaster in the original code of the web:

Quote:

I just want to remove from <a href to "_blank"> in all articles contained in the recipe, how can I make this?

I tried with preprocess_regexps command but I don't know the sintax, I've the same error over and over again:

Anybody can help me?

Thanks

(click on the images to see them bigger)

kovidgoyal · 01-31-2011, 03:10 PM

Add
import re

near the top of your recipe

tolyluis · 01-31-2011, 04:01 PM

Nice! Fantastico! It works! Now a new version of 20 minutos is coming, stay in....

tolyluis · 01-31-2011, 07:10 PM

... and here is:

CHANGELOG

- Changed oldest_article from 5 to 2, now the ebook is around 3 Mb
- Added CSS style, looks better now
- Adjust code for erase some indeseable content
- Other minor changes

NOTES

First time using commands re, comics has no changes this time, may be in a future... (a little concepts more and may I can fix it). Excepts the comics, the recipe looks fantastic now

SOURCE CODE

Code:

__license__   = 'GPL v3'
__author__    = 'Luis Hernandez'
__copyright__ = 'Luis Hernandez<tolyluis@gmail.com>'
__version__     = 'v0.85'
__date__        = '31 January 2011'

'''
www.20minutos.es
'''
import re
class AdvancedUserRecipe1294946868(BasicNewsRecipe):

    title          = u'20 Minutos'
    publisher      = u'Grupo 20 Minutos'

    __author__            = 'Luis Hernandez'
    description           = 'Free spanish newspaper'
    cover_url     = 'http://estaticos.20minutos.es/mmedia/especiales/corporativo/css/img/logotipos_grupo20minutos.gif'

    oldest_article = 2
    max_articles_per_feed = 100

    remove_javascript = True
    no_stylesheets        = True
    use_embedded_content  = False

    encoding              = 'ISO-8859-1'
    language              = 'es_ES'
    timefmt        = '[%a, %d %b, %Y]'
    remove_empty_feeds    = True

    keep_only_tags     = [
                                   dict(name='div', attrs={'id':['content','vinetas',]})
                                  ,dict(name='div', attrs={'class':['boxed','description','lead','article-content','cuerpo estirar']})
                                  ,dict(name='span', attrs={'class':['photo-bar']})
                                  ,dict(name='ul', attrs={'class':['article-author']})
                                ]

    remove_tags_before = dict(name='ul' , attrs={'class':['servicios-sub']})
    remove_tags_after  = dict(name='div' , attrs={'class':['related-news','col']})

    remove_tags = [
                     dict(name='ol', attrs={'class':['navigation',]})
                    ,dict(name='span', attrs={'class':['action']})
                    ,dict(name='div', attrs={'class':['twitter comments-list hidden','related-news','col','photo-gallery','photo-gallery side-art-block','calendario','article-comment','postto estirar','otras_vinetas estirar','kment','user-actions']})
                    ,dict(name='div', attrs={'id':['twitter-destacados','eco-tabs','inner','vineta_calendario','vinetistas clearfix','otras_vinetas estirar','MIN1','main','SUP1','INT']})
                    ,dict(name='ul', attrs={'class':['article-user-actions','stripped-list']})
                    ,dict(name='ul', attrs={'id':['site-links']})
                    ,dict(name='li', attrs={'class':['puntuacion','enviar','compartir']})
                       ]

    extra_css             = """
                               p{text-align: justify; font-size: 100%}
                               body{ text-align: left; font-size:100% }
                               h3{font-family: sans-serif; font-size:150%; font-weight:bold; text-align: justify; }
                                 """					   
					   
    preprocess_regexps = [(re.compile(r'<a href="http://estaticos.*?[0-999]px;" target="_blank">', re.DOTALL), lambda m: '')]

    feeds = [
              (u'Portada'              , u'http://www.20minutos.es/rss/')
             ,(u'Nacional'             , u'http://www.20minutos.es/rss/nacional/')
             ,(u'Internacional'       , u'http://www.20minutos.es/rss/internacional/')
             ,(u'Economia'           , u'http://www.20minutos.es/rss/economia/')
             ,(u'Deportes'            , u'http://www.20minutos.es/rss/deportes/')
             ,(u'Tecnologia'          , u'http://www.20minutos.es/rss/tecnologia/')
             ,(u'Gente - TV'         , u'http://www.20minutos.es/rss/gente-television/')
             ,(u'Motor'                 , u'http://www.20minutos.es/rss/motor/')
             ,(u'Salud'                 , u'http://www.20minutos.es/rss/belleza-y-salud/')
             ,(u'Viajes'                , u'http://www.20minutos.es/rss/viajes/')
             ,(u'Vivienda'             , u'http://www.20minutos.es/rss/vivienda/')
             ,(u'Empleo'              , u'http://www.20minutos.es/rss/empleo/')
             ,(u'Cine'                  , u'http://www.20minutos.es/rss/cine/')
             ,(u'Musica'               , u'http://www.20minutos.es/rss/musica/')
             ,(u'Vinetas'          , u'http://www.20minutos.es/rss/vinetas/')
             ,(u'Comunidad20'     , u'http://www.20minutos.es/rss/zona20/')
            ]

I will like some feedback from users

If the language is a problem just PM to me

miwie · 02-01-2011, 03:14 AM

Quote:

Originally Posted by tolyluis

I will like some feedback from users

If the language is a problem just PM to me

First look is really good, suggestions:

Try to omit text-align: justify or change it to text-align: left in extra_css.
IMHO this looks much better on mobile reading devices

The cover page (logo of the periodical) does not look good. Try to find something different. Maybe you can find the title page of the daily edition.

tolyluis · 02-01-2011, 09:38 AM

Quote:

Originally Posted by miwie

First look is really good, suggestions:

Try to omit text-align: justify or change it to text-align: left in extra_css.
IMHO this looks much better on mobile reading devices

The cover page (logo of the periodical) does not look good. Try to find something different. Maybe you can find the title page of the daily edition.

Hi, thanks for your suggestions, they are appreciated.

At first, sorry, I don't like text-align: left, I prefer justified text. IMHO looks better in my Kindle, if you want text-align: left, just personalize it!

The second suggestion will be revised in a future (with comics

)

Thanks for your feedback!

miwie · 02-01-2011, 10:02 AM

Quote:

Originally Posted by tolyluis

... if you want text-align: left, just personalize it!

Unfortunately there ist no easy way to personalize such settings in recipes - other than changing the code, which gets lost on update

tolyluis · 02-01-2011, 11:20 AM

Quote:

Originally Posted by miwie

Unfortunately there ist no easy way to personalize such settings in recipes - other than changing the code, which gets lost on update

You must use personalized recipes for that task, updates don't affect them and my idea is to post on this forum new code for my recipes, just adapt it to your needs/likes.

Just change the title (set it to "mi 20 minutos" i.e), press Add/update recipe and voila! a new personaliced recipe that not affects updates.

01-28-2011, 03:17 AM	#5
nadid Junior Member Posts: 8 Karma: 10 Join Date: Jan 2011 Device: Kindle 3	This afternoon I will test it thanks for your work!. I will give you some feedback tonight.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
20 Minutos (boletín) + La tribuna de Talavera	tolyluis	Recipes	3	01-28-2011 12:46 PM
Just Got A Kindle, Next Steps?	grechzoo	General Discussions	17	05-23-2010 09:20 AM
Best first steps with Kindle	ficbot	Amazon Kindle	16	01-16-2010 06:20 PM
ereader2ereader in two steps	=X=	Workshop	15	12-15-2009 07:58 PM
interim conversion steps	ambertape	Sony Reader	6	04-14-2008 01:34 PM

01-25-2011, 11:38 AM	#2
kovidgoyal creator of calibre Posts: 46,298 Karma: 29630860 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Will be in next release

01-26-2011, 10:20 AM	#3
nadid Junior Member Posts: 8 Karma: 10 Join Date: Jan 2011 Device: Kindle 3	Tolyluis, Please can you do the same for the viñetas/comics of 20 minutes? Maybe a different recipe? Thank you.

01-31-2011, 03:10 PM	#8
kovidgoyal creator of calibre Posts: 46,298 Karma: 29630860 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Add import re near the top of your recipe

01-31-2011, 04:01 PM	#9
tolyluis Enthusiast Posts: 49 Karma: 196 Join Date: Jan 2011 Device: Kindle 3	Nice! Fantastico! It works! Now a new version of 20 minutos is coming, stay in....

Advert

Advert