Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 01-24-2011, 11:06 PM   #1
tolyluis
Enthusiast
tolyluis doesn't littertolyluis doesn't litter
 
Posts: 49
Karma: 196
Join Date: Jan 2011
Device: Kindle 3
20 Minutos - First steps

Hi again:

I have made a first-attempt to take 20 Minutos web page in a readable ebook. The result is very good, is (almost) perfect. It takes a lot of time (about 5 mins) and space to make a HUGE file* (about 4 Mb) of this well-know-in-internet online spanish newspaper. I think my work on it increases the offer in spanish newspapers for Calibre . But it has some LIMITATIONS: a) Don't get comics (Viñetas), I don't know (yet); b) I don't take the local news feeds, is a HUGE file as is now

This is my recipe:

20minutos.es - One of the most visited spanish web newspaper

Code:
__license__   = 'GPL v3'
__author__    = 'Luis Hernandez'
__copyright__ = 'Luis Hernandez<tolyluis@gmail.com>'
description   = 'Periódico gratuito en español - v0.5 - 25 Jan 2011'

'''
www.20minutos.es
'''

class AdvancedUserRecipe1294946868(BasicNewsRecipe):

    title          = u'20 Minutos'
    publisher      = u'Grupo 20 Minutos'

    __author__            = 'Luis Hernández'
    description           = 'Periódico gratuito en español'
    cover_url     = 'http://estaticos.20minutos.es/mmedia/especiales/corporativo/css/img/logotipos_grupo20minutos.gif'

    oldest_article = 5
    max_articles_per_feed = 100

    remove_javascript = True
    no_stylesheets        = True
    use_embedded_content  = False

    encoding              = 'ISO-8859-1'
    language              = 'es'
    timefmt        = '[%a, %d %b, %Y]'

    keep_only_tags     = [dict(name='div', attrs={'id':['content']})
                                  ,dict(name='div', attrs={'class':['boxed','description','lead','article-content']})
                                  ,dict(name='span', attrs={'class':['photo-bar']})
                                  ,dict(name='ul', attrs={'class':['article-author']})                                
                                ]

    remove_tags_before = dict(name='ul' , attrs={'class':['servicios-sub']})
    remove_tags_after  = dict(name='div' , attrs={'class':['related-news','col']})

    remove_tags = [
                     dict(name='ol', attrs={'class':['navigation',]})
                    ,dict(name='span', attrs={'class':['action']})
                    ,dict(name='div', attrs={'class':['twitter comments-list hidden','related-news','col']})
                    ,dict(name='div', attrs={'id':['twitter-destacados']})
                    ,dict(name='ul', attrs={'class':['article-user-actions','stripped-list']})
                                          ]

    feeds = [
              (u'Portada'              , u'http://www.20minutos.es/rss/')
             ,(u'Nacional'             , u'http://www.20minutos.es/rss/nacional/')
             ,(u'Internacional'       , u'http://www.20minutos.es/rss/internacional/')
             ,(u'Economia'           , u'http://www.20minutos.es/rss/economia/')
             ,(u'Deportes'            , u'http://www.20minutos.es/rss/deportes/')
             ,(u'Tecnologia'          , u'http://www.20minutos.es/rss/tecnologia/')
             ,(u'Gente - TV'         , u'http://www.20minutos.es/rss/gente-television/')
             ,(u'Motor'                 , u'http://www.20minutos.es/rss/motor/')
             ,(u'Salud'                 , u'http://www.20minutos.es/rss/belleza-y-salud/')
             ,(u'Viajes'                , u'http://www.20minutos.es/rss/viajes/')
             ,(u'Vivienda'             , u'http://www.20minutos.es/rss/vivienda/')
             ,(u'Empleo'              , u'http://www.20minutos.es/rss/empleo/')
             ,(u'Cine'                  , u'http://www.20minutos.es/rss/cine/')
             ,(u'Musica'               , u'http://www.20minutos.es/rss/musica/')
             ,(u'Comunidad20'     , u'http://www.20minutos.es/rss/zona20/')
            ]
Of course, this recipe can be used for anybody to make it better!

*With oldest_article = 5, you can change the days to your needs.
tolyluis is offline   Reply With Quote
Old 01-25-2011, 11:38 AM   #2
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,866
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Will be in next release
kovidgoyal is offline   Reply With Quote
Advert
Old 01-26-2011, 10:20 AM   #3
nadid
Junior Member
nadid began at the beginning.
 
Posts: 8
Karma: 10
Join Date: Jan 2011
Device: Kindle 3
Tolyluis,

Please can you do the same for the viñetas/comics of 20 minutes? Maybe a different recipe?

Thank you.
nadid is offline   Reply With Quote
Old 01-27-2011, 11:01 AM   #4
tolyluis
Enthusiast
tolyluis doesn't littertolyluis doesn't litter
 
Posts: 49
Karma: 196
Join Date: Jan 2011
Device: Kindle 3
20 minutos (v1.2)

Hi again.

I worked over this recipe last night, I've a new version WITH comics.

CHANGELOG

v0.8

- Adjust code for erase some indeseable content
- Added comics (viñetas) with bugs (may be repaired)

Source Code:

Code:
__license__   = 'GPL v3'
__author__    = 'Luis Hernandez'
__copyright__ = 'Luis Hernandez<tolyluis@gmail.com>'
description   = 'Periódico gratuito en español - v0.8 - 27 Jan 2011'

'''
www.20minutos.es
'''

class AdvancedUserRecipe1294946868(BasicNewsRecipe):

    title          = u'20 Minutos'
    publisher      = u'Grupo 20 Minutos'

    __author__            = 'Luis Hernández'
    description           = 'Periódico gratuito en español'
    cover_url     = 'http://estaticos.20minutos.es/mmedia/especiales/corporativo/css/img/logotipos_grupo20minutos.gif'

    oldest_article = 5
    max_articles_per_feed = 100

    remove_javascript = True
    no_stylesheets        = True
    use_embedded_content  = False

    encoding              = 'ISO-8859-1'
    language              = 'es'
    timefmt        = '[%a, %d %b, %Y]'

    keep_only_tags     = [
                                   dict(name='div', attrs={'id':['content','vinetas',]})
                                  ,dict(name='div', attrs={'class':['boxed','description','lead','article-content','cuerpo estirar']})
                                  ,dict(name='span', attrs={'class':['photo-bar']})
                                  ,dict(name='ul', attrs={'class':['article-author']})                                
                                ]

    remove_tags_before = dict(name='ul' , attrs={'class':['servicios-sub']})
    remove_tags_after  = dict(name='div' , attrs={'class':['related-news','col']})

    remove_tags = [
                     dict(name='ol', attrs={'class':['navigation',]})
                    ,dict(name='span', attrs={'class':['action']})
                    ,dict(name='div', attrs={'class':['twitter comments-list hidden','related-news','col','photo-gallery','calendario','article-comment','postto estirar','otras_vinetas estirar','kment','user-actions']})
                    ,dict(name='div', attrs={'id':['twitter-destacados','eco-tabs','inner','vineta_calendario','vinetistas clearfix','otras_vinetas estirar','MIN1','main','SUP1','INT']})
                    ,dict(name='ul', attrs={'class':['article-user-actions','stripped-list']})
                    ,dict(name='ul', attrs={'id':['site-links']})
                    ,dict(name='li', attrs={'class':['puntuacion','enviar','compartir']})
                       ]

    feeds = [
              (u'Portada'              , u'http://www.20minutos.es/rss/')
             ,(u'Nacional'             , u'http://www.20minutos.es/rss/nacional/')
             ,(u'Internacional'       , u'http://www.20minutos.es/rss/internacional/')
             ,(u'Economia'           , u'http://www.20minutos.es/rss/economia/')
             ,(u'Deportes'            , u'http://www.20minutos.es/rss/deportes/')
             ,(u'Tecnologia'          , u'http://www.20minutos.es/rss/tecnologia/')
             ,(u'Gente - TV'         , u'http://www.20minutos.es/rss/gente-television/')
             ,(u'Motor'                 , u'http://www.20minutos.es/rss/motor/')
             ,(u'Salud'                 , u'http://www.20minutos.es/rss/belleza-y-salud/')
             ,(u'Viajes'                , u'http://www.20minutos.es/rss/viajes/')
             ,(u'Vivienda'             , u'http://www.20minutos.es/rss/vivienda/')
             ,(u'Empleo'              , u'http://www.20minutos.es/rss/empleo/')
             ,(u'Cine'                  , u'http://www.20minutos.es/rss/cine/')
             ,(u'Musica'               , u'http://www.20minutos.es/rss/musica/')
             ,(u'Vinetas'              , u'http://www.20minutos.es/rss/vinetas/')
             ,(u'Comunidad20'     , u'http://www.20minutos.es/rss/zona20/')
            ]
May be comics be fixed with (I'll try to open a new thread later)

Hope you enjoy this version. I will like some feedback.
tolyluis is offline   Reply With Quote
Old 01-28-2011, 03:17 AM   #5
nadid
Junior Member
nadid began at the beginning.
 
Posts: 8
Karma: 10
Join Date: Jan 2011
Device: Kindle 3
This afternoon

I will test it thanks for your work!.

I will give you some feedback tonight.
nadid is offline   Reply With Quote
Advert
Old 01-28-2011, 12:34 PM   #6
tolyluis
Enthusiast
tolyluis doesn't littertolyluis doesn't litter
 
Posts: 49
Karma: 196
Join Date: Jan 2011
Device: Kindle 3
20 Minutos (v0.8 ct)

A little changes is necesary in the code for optimal perfomance in testing mode using command ebook-export, no changes made in the "real" code, just has been erased some non-ascii characters.

SOURCE CODE

Code:
__license__   = 'GPL v3'
__author__    = 'Luis Hernandez'
__copyright__ = 'Luis Hernandez<tolyluis@gmail.com>'


'''
www.20minutos.es
'''

class AdvancedUserRecipe1294946868(BasicNewsRecipe):

    title          = u'20 Minutos'
    publisher      = u'Grupo 20 Minutos'

    __author__            = 'Luis Hernandez'
    description           = 'Periodico gratuito independiente'
    cover_url     = 'http://estaticos.20minutos.es/mmedia/especiales/corporativo/css/img/logotipos_grupo20minutos.gif'

    oldest_article = 5
    max_articles_per_feed = 100

    remove_javascript = True
    no_stylesheets        = True
    use_embedded_content  = False

    encoding              = 'ISO-8859-1'
    language              = 'es'
    timefmt        = '[%a, %d %b, %Y]'

    keep_only_tags     = [
                                   dict(name='div', attrs={'id':['content','vinetas',]})
                                  ,dict(name='div', attrs={'class':['boxed','description','lead','article-content','cuerpo estirar']})
                                  ,dict(name='span', attrs={'class':['photo-bar']})
                                  ,dict(name='ul', attrs={'class':['article-author']})                                
                                ]

    remove_tags_before = dict(name='ul' , attrs={'class':['servicios-sub']})
    remove_tags_after  = dict(name='div' , attrs={'class':['related-news','col']})

    remove_tags = [
                     dict(name='ol', attrs={'class':['navigation',]})
                    ,dict(name='span', attrs={'class':['action']})
                    ,dict(name='div', attrs={'class':['twitter comments-list hidden','related-news','col','photo-gallery','calendario','article-comment','postto estirar','otras_vinetas estirar','kment','user-actions']})
                    ,dict(name='div', attrs={'id':['twitter-destacados','eco-tabs','inner','vineta_calendario','vinetistas clearfix','otras_vinetas estirar','MIN1','main','SUP1','INT']})
                    ,dict(name='ul', attrs={'class':['article-user-actions','stripped-list']})
                    ,dict(name='ul', attrs={'id':['site-links']})
                    ,dict(name='li', attrs={'class':['puntuacion','enviar','compartir']})
                       ]

    feeds = [
              (u'Portada'              , u'http://www.20minutos.es/rss/')
             ,(u'Nacional'             , u'http://www.20minutos.es/rss/nacional/')
             ,(u'Internacional'       , u'http://www.20minutos.es/rss/internacional/')
             ,(u'Economia'           , u'http://www.20minutos.es/rss/economia/')
             ,(u'Deportes'            , u'http://www.20minutos.es/rss/deportes/')
             ,(u'Tecnologia'          , u'http://www.20minutos.es/rss/tecnologia/')
             ,(u'Gente - TV'         , u'http://www.20minutos.es/rss/gente-television/')
             ,(u'Motor'                 , u'http://www.20minutos.es/rss/motor/')
             ,(u'Salud'                 , u'http://www.20minutos.es/rss/belleza-y-salud/')
             ,(u'Viajes'                , u'http://www.20minutos.es/rss/viajes/')
             ,(u'Vivienda'             , u'http://www.20minutos.es/rss/vivienda/')
             ,(u'Empleo'              , u'http://www.20minutos.es/rss/empleo/')
             ,(u'Cine'                  , u'http://www.20minutos.es/rss/cine/')
             ,(u'Musica'               , u'http://www.20minutos.es/rss/musica/')
             ,(u'Vinetas'              , u'http://www.20minutos.es/rss/vinetas/')
             ,(u'Comunidad20'     , u'http://www.20minutos.es/rss/zona20/')
            ]
Sorry for duplicating posts.
tolyluis is offline   Reply With Quote
Old 01-31-2011, 02:32 PM   #7
tolyluis
Enthusiast
tolyluis doesn't littertolyluis doesn't litter
 
Posts: 49
Karma: 196
Join Date: Jan 2011
Device: Kindle 3
Help me to improve this recipe

Hi all

I have a problem with this recipe, the original web page looks:



And my recipe shows it:




All the articles have the same problem, I have localized the guilty code responsible of this disaster in the original code of the web:

Quote:
<a href="http://estaticos.20minutos.es/img2/recortes/2011/01/30/7626-944-550.jpg" title="<p><strong>Novak Djokovic</strong></p>
<p>El tenista serbio Novak Djokovic celebrando su triunfo en Australia. (BARBARA WALTON / EFE)</p>" class="article-photo photo _620x282 imagebox" rel="imagenes" style="height: 282px;" target="_blank">
I just want to remove from <a href to "_blank"> in all articles contained in the recipe, how can I make this?

I tried with preprocess_regexps command but I don't know the sintax, I've the same error over and over again:



Anybody can help me?

Thanks

(click on the images to see them bigger)
tolyluis is offline   Reply With Quote
Old 01-31-2011, 03:10 PM   #8
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,866
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Add
import re

near the top of your recipe
kovidgoyal is offline   Reply With Quote
Old 01-31-2011, 04:01 PM   #9
tolyluis
Enthusiast
tolyluis doesn't littertolyluis doesn't litter
 
Posts: 49
Karma: 196
Join Date: Jan 2011
Device: Kindle 3
Nice! Fantastico! It works! Now a new version of 20 minutos is coming, stay in....
tolyluis is offline   Reply With Quote
Old 01-31-2011, 07:10 PM   #10
tolyluis
Enthusiast
tolyluis doesn't littertolyluis doesn't litter
 
Posts: 49
Karma: 196
Join Date: Jan 2011
Device: Kindle 3
20 Minutos (v0.85)

... and here is:

CHANGELOG

- Changed oldest_article from 5 to 2, now the ebook is around 3 Mb
- Added CSS style, looks better now
- Adjust code for erase some indeseable content
- Other minor changes

NOTES

First time using commands re, comics has no changes this time, may be in a future... (a little concepts more and may I can fix it). Excepts the comics, the recipe looks fantastic now

SOURCE CODE

Code:
__license__   = 'GPL v3'
__author__    = 'Luis Hernandez'
__copyright__ = 'Luis Hernandez<tolyluis@gmail.com>'
__version__     = 'v0.85'
__date__        = '31 January 2011'

'''
www.20minutos.es
'''
import re
class AdvancedUserRecipe1294946868(BasicNewsRecipe):

    title          = u'20 Minutos'
    publisher      = u'Grupo 20 Minutos'

    __author__            = 'Luis Hernandez'
    description           = 'Free spanish newspaper'
    cover_url     = 'http://estaticos.20minutos.es/mmedia/especiales/corporativo/css/img/logotipos_grupo20minutos.gif'

    oldest_article = 2
    max_articles_per_feed = 100

    remove_javascript = True
    no_stylesheets        = True
    use_embedded_content  = False

    encoding              = 'ISO-8859-1'
    language              = 'es_ES'
    timefmt        = '[%a, %d %b, %Y]'
    remove_empty_feeds    = True

    keep_only_tags     = [
                                   dict(name='div', attrs={'id':['content','vinetas',]})
                                  ,dict(name='div', attrs={'class':['boxed','description','lead','article-content','cuerpo estirar']})
                                  ,dict(name='span', attrs={'class':['photo-bar']})
                                  ,dict(name='ul', attrs={'class':['article-author']})
                                ]

    remove_tags_before = dict(name='ul' , attrs={'class':['servicios-sub']})
    remove_tags_after  = dict(name='div' , attrs={'class':['related-news','col']})

    remove_tags = [
                     dict(name='ol', attrs={'class':['navigation',]})
                    ,dict(name='span', attrs={'class':['action']})
                    ,dict(name='div', attrs={'class':['twitter comments-list hidden','related-news','col','photo-gallery','photo-gallery side-art-block','calendario','article-comment','postto estirar','otras_vinetas estirar','kment','user-actions']})
                    ,dict(name='div', attrs={'id':['twitter-destacados','eco-tabs','inner','vineta_calendario','vinetistas clearfix','otras_vinetas estirar','MIN1','main','SUP1','INT']})
                    ,dict(name='ul', attrs={'class':['article-user-actions','stripped-list']})
                    ,dict(name='ul', attrs={'id':['site-links']})
                    ,dict(name='li', attrs={'class':['puntuacion','enviar','compartir']})
                       ]

    extra_css             = """
                               p{text-align: justify; font-size: 100%}
                               body{ text-align: left; font-size:100% }
                               h3{font-family: sans-serif; font-size:150%; font-weight:bold; text-align: justify; }
                                 """					   
					   
    preprocess_regexps = [(re.compile(r'<a href="http://estaticos.*?[0-999]px;" target="_blank">', re.DOTALL), lambda m: '')]

    feeds = [
              (u'Portada'              , u'http://www.20minutos.es/rss/')
             ,(u'Nacional'             , u'http://www.20minutos.es/rss/nacional/')
             ,(u'Internacional'       , u'http://www.20minutos.es/rss/internacional/')
             ,(u'Economia'           , u'http://www.20minutos.es/rss/economia/')
             ,(u'Deportes'            , u'http://www.20minutos.es/rss/deportes/')
             ,(u'Tecnologia'          , u'http://www.20minutos.es/rss/tecnologia/')
             ,(u'Gente - TV'         , u'http://www.20minutos.es/rss/gente-television/')
             ,(u'Motor'                 , u'http://www.20minutos.es/rss/motor/')
             ,(u'Salud'                 , u'http://www.20minutos.es/rss/belleza-y-salud/')
             ,(u'Viajes'                , u'http://www.20minutos.es/rss/viajes/')
             ,(u'Vivienda'             , u'http://www.20minutos.es/rss/vivienda/')
             ,(u'Empleo'              , u'http://www.20minutos.es/rss/empleo/')
             ,(u'Cine'                  , u'http://www.20minutos.es/rss/cine/')
             ,(u'Musica'               , u'http://www.20minutos.es/rss/musica/')
             ,(u'Vinetas'          , u'http://www.20minutos.es/rss/vinetas/')
             ,(u'Comunidad20'     , u'http://www.20minutos.es/rss/zona20/')
            ]
I will like some feedback from users If the language is a problem just PM to me

Last edited by tolyluis; 01-31-2011 at 07:34 PM.
tolyluis is offline   Reply With Quote
Old 02-01-2011, 03:14 AM   #11
miwie
Connoisseur
miwie began at the beginning.
 
Posts: 76
Karma: 12
Join Date: Nov 2010
Device: Android, PB Pro 602
Quote:
Originally Posted by tolyluis View Post
I will like some feedback from users If the language is a problem just PM to me
First look is really good, suggestions:

Try to omit text-align: justify or change it to text-align: left in extra_css.
IMHO this looks much better on mobile reading devices

The cover page (logo of the periodical) does not look good. Try to find something different. Maybe you can find the title page of the daily edition.
miwie is offline   Reply With Quote
Old 02-01-2011, 09:38 AM   #12
tolyluis
Enthusiast
tolyluis doesn't littertolyluis doesn't litter
 
Posts: 49
Karma: 196
Join Date: Jan 2011
Device: Kindle 3
Quote:
Originally Posted by miwie View Post
First look is really good, suggestions:

Try to omit text-align: justify or change it to text-align: left in extra_css.
IMHO this looks much better on mobile reading devices

The cover page (logo of the periodical) does not look good. Try to find something different. Maybe you can find the title page of the daily edition.
Hi, thanks for your suggestions, they are appreciated.

At first, sorry, I don't like text-align: left, I prefer justified text. IMHO looks better in my Kindle, if you want text-align: left, just personalize it!

The second suggestion will be revised in a future (with comics )

Thanks for your feedback!
tolyluis is offline   Reply With Quote
Old 02-01-2011, 10:02 AM   #13
miwie
Connoisseur
miwie began at the beginning.
 
Posts: 76
Karma: 12
Join Date: Nov 2010
Device: Android, PB Pro 602
Quote:
Originally Posted by tolyluis View Post
... if you want text-align: left, just personalize it!
Unfortunately there ist no easy way to personalize such settings in recipes - other than changing the code, which gets lost on update
miwie is offline   Reply With Quote
Old 02-01-2011, 11:20 AM   #14
tolyluis
Enthusiast
tolyluis doesn't littertolyluis doesn't litter
 
Posts: 49
Karma: 196
Join Date: Jan 2011
Device: Kindle 3
Quote:
Originally Posted by miwie View Post
Unfortunately there ist no easy way to personalize such settings in recipes - other than changing the code, which gets lost on update
You must use personalized recipes for that task, updates don't affect them and my idea is to post on this forum new code for my recipes, just adapt it to your needs/likes.

Just change the title (set it to "mi 20 minutos" i.e), press Add/update recipe and voila! a new personaliced recipe that not affects updates.

Last edited by tolyluis; 02-01-2011 at 06:14 PM.
tolyluis is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
20 Minutos (boletín) + La tribuna de Talavera tolyluis Recipes 3 01-28-2011 12:46 PM
Just Got A Kindle, Next Steps? grechzoo General Discussions 17 05-23-2010 09:20 AM
Best first steps with Kindle ficbot Amazon Kindle 16 01-16-2010 06:20 PM
ereader2ereader in two steps =X= Workshop 15 12-15-2009 07:58 PM
interim conversion steps ambertape Sony Reader 6 04-14-2008 01:34 PM


All times are GMT -4. The time now is 08:37 PM.


MobileRead.com is a privately owned, operated and funded community.