|
|
#1 |
|
Enthusiast
![]() ![]() Posts: 49
Karma: 196
Join Date: Jan 2011
Device: Kindle 3
|
20 Minutos - First steps
Hi again:
I have made a first-attempt to take 20 Minutos web page in a readable ebook. The result is very good, is (almost) perfect. It takes a lot of time (about 5 mins) and space to make a HUGE file* (about 4 Mb) of this well-know-in-internet online spanish newspaper. I think my work on it increases the offer in spanish newspapers for Calibre . But it has some LIMITATIONS: a) Don't get comics (Viñetas), I don't know (yet); b) I don't take the local news feeds, is a HUGE file as is nowThis is my recipe: 20minutos.es - One of the most visited spanish web newspaper Code:
__license__ = 'GPL v3'
__author__ = 'Luis Hernandez'
__copyright__ = 'Luis Hernandez<tolyluis@gmail.com>'
description = 'Periódico gratuito en español - v0.5 - 25 Jan 2011'
'''
www.20minutos.es
'''
class AdvancedUserRecipe1294946868(BasicNewsRecipe):
title = u'20 Minutos'
publisher = u'Grupo 20 Minutos'
__author__ = 'Luis Hernández'
description = 'Periódico gratuito en español'
cover_url = 'http://estaticos.20minutos.es/mmedia/especiales/corporativo/css/img/logotipos_grupo20minutos.gif'
oldest_article = 5
max_articles_per_feed = 100
remove_javascript = True
no_stylesheets = True
use_embedded_content = False
encoding = 'ISO-8859-1'
language = 'es'
timefmt = '[%a, %d %b, %Y]'
keep_only_tags = [dict(name='div', attrs={'id':['content']})
,dict(name='div', attrs={'class':['boxed','description','lead','article-content']})
,dict(name='span', attrs={'class':['photo-bar']})
,dict(name='ul', attrs={'class':['article-author']})
]
remove_tags_before = dict(name='ul' , attrs={'class':['servicios-sub']})
remove_tags_after = dict(name='div' , attrs={'class':['related-news','col']})
remove_tags = [
dict(name='ol', attrs={'class':['navigation',]})
,dict(name='span', attrs={'class':['action']})
,dict(name='div', attrs={'class':['twitter comments-list hidden','related-news','col']})
,dict(name='div', attrs={'id':['twitter-destacados']})
,dict(name='ul', attrs={'class':['article-user-actions','stripped-list']})
]
feeds = [
(u'Portada' , u'http://www.20minutos.es/rss/')
,(u'Nacional' , u'http://www.20minutos.es/rss/nacional/')
,(u'Internacional' , u'http://www.20minutos.es/rss/internacional/')
,(u'Economia' , u'http://www.20minutos.es/rss/economia/')
,(u'Deportes' , u'http://www.20minutos.es/rss/deportes/')
,(u'Tecnologia' , u'http://www.20minutos.es/rss/tecnologia/')
,(u'Gente - TV' , u'http://www.20minutos.es/rss/gente-television/')
,(u'Motor' , u'http://www.20minutos.es/rss/motor/')
,(u'Salud' , u'http://www.20minutos.es/rss/belleza-y-salud/')
,(u'Viajes' , u'http://www.20minutos.es/rss/viajes/')
,(u'Vivienda' , u'http://www.20minutos.es/rss/vivienda/')
,(u'Empleo' , u'http://www.20minutos.es/rss/empleo/')
,(u'Cine' , u'http://www.20minutos.es/rss/cine/')
,(u'Musica' , u'http://www.20minutos.es/rss/musica/')
,(u'Comunidad20' , u'http://www.20minutos.es/rss/zona20/')
]
*With oldest_article = 5, you can change the days to your needs. |
|
|
|
|
|
#2 |
|
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,609
Karma: 28549044
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Will be in next release
|
|
|
|
| Advert | |
|
|
|
|
#3 |
|
Junior Member
![]() Posts: 8
Karma: 10
Join Date: Jan 2011
Device: Kindle 3
|
Tolyluis,
Please can you do the same for the viñetas/comics of 20 minutes? Maybe a different recipe? Thank you. |
|
|
|
|
|
#4 |
|
Enthusiast
![]() ![]() Posts: 49
Karma: 196
Join Date: Jan 2011
Device: Kindle 3
|
20 minutos (v1.2)
Hi again.
I worked over this recipe last night, I've a new version WITH comics. CHANGELOG v0.8 - Adjust code for erase some indeseable content - Added comics (viñetas) with bugs (may be repaired) Source Code: Code:
__license__ = 'GPL v3'
__author__ = 'Luis Hernandez'
__copyright__ = 'Luis Hernandez<tolyluis@gmail.com>'
description = 'Periódico gratuito en español - v0.8 - 27 Jan 2011'
'''
www.20minutos.es
'''
class AdvancedUserRecipe1294946868(BasicNewsRecipe):
title = u'20 Minutos'
publisher = u'Grupo 20 Minutos'
__author__ = 'Luis Hernández'
description = 'Periódico gratuito en español'
cover_url = 'http://estaticos.20minutos.es/mmedia/especiales/corporativo/css/img/logotipos_grupo20minutos.gif'
oldest_article = 5
max_articles_per_feed = 100
remove_javascript = True
no_stylesheets = True
use_embedded_content = False
encoding = 'ISO-8859-1'
language = 'es'
timefmt = '[%a, %d %b, %Y]'
keep_only_tags = [
dict(name='div', attrs={'id':['content','vinetas',]})
,dict(name='div', attrs={'class':['boxed','description','lead','article-content','cuerpo estirar']})
,dict(name='span', attrs={'class':['photo-bar']})
,dict(name='ul', attrs={'class':['article-author']})
]
remove_tags_before = dict(name='ul' , attrs={'class':['servicios-sub']})
remove_tags_after = dict(name='div' , attrs={'class':['related-news','col']})
remove_tags = [
dict(name='ol', attrs={'class':['navigation',]})
,dict(name='span', attrs={'class':['action']})
,dict(name='div', attrs={'class':['twitter comments-list hidden','related-news','col','photo-gallery','calendario','article-comment','postto estirar','otras_vinetas estirar','kment','user-actions']})
,dict(name='div', attrs={'id':['twitter-destacados','eco-tabs','inner','vineta_calendario','vinetistas clearfix','otras_vinetas estirar','MIN1','main','SUP1','INT']})
,dict(name='ul', attrs={'class':['article-user-actions','stripped-list']})
,dict(name='ul', attrs={'id':['site-links']})
,dict(name='li', attrs={'class':['puntuacion','enviar','compartir']})
]
feeds = [
(u'Portada' , u'http://www.20minutos.es/rss/')
,(u'Nacional' , u'http://www.20minutos.es/rss/nacional/')
,(u'Internacional' , u'http://www.20minutos.es/rss/internacional/')
,(u'Economia' , u'http://www.20minutos.es/rss/economia/')
,(u'Deportes' , u'http://www.20minutos.es/rss/deportes/')
,(u'Tecnologia' , u'http://www.20minutos.es/rss/tecnologia/')
,(u'Gente - TV' , u'http://www.20minutos.es/rss/gente-television/')
,(u'Motor' , u'http://www.20minutos.es/rss/motor/')
,(u'Salud' , u'http://www.20minutos.es/rss/belleza-y-salud/')
,(u'Viajes' , u'http://www.20minutos.es/rss/viajes/')
,(u'Vivienda' , u'http://www.20minutos.es/rss/vivienda/')
,(u'Empleo' , u'http://www.20minutos.es/rss/empleo/')
,(u'Cine' , u'http://www.20minutos.es/rss/cine/')
,(u'Musica' , u'http://www.20minutos.es/rss/musica/')
,(u'Vinetas' , u'http://www.20minutos.es/rss/vinetas/')
,(u'Comunidad20' , u'http://www.20minutos.es/rss/zona20/')
]
(I'll try to open a new thread later) Hope you enjoy this version. I will like some feedback. |
|
|
|
|
|
#5 |
|
Junior Member
![]() Posts: 8
Karma: 10
Join Date: Jan 2011
Device: Kindle 3
|
This afternoon
I will test it
thanks for your work!.I will give you some feedback tonight. |
|
|
|
| Advert | |
|
|
|
|
#6 |
|
Enthusiast
![]() ![]() Posts: 49
Karma: 196
Join Date: Jan 2011
Device: Kindle 3
|
20 Minutos (v0.8 ct)
A little changes is necesary in the code for optimal perfomance in testing mode using command ebook-export, no changes made in the "real" code, just has been erased some non-ascii characters.
SOURCE CODE Code:
__license__ = 'GPL v3'
__author__ = 'Luis Hernandez'
__copyright__ = 'Luis Hernandez<tolyluis@gmail.com>'
'''
www.20minutos.es
'''
class AdvancedUserRecipe1294946868(BasicNewsRecipe):
title = u'20 Minutos'
publisher = u'Grupo 20 Minutos'
__author__ = 'Luis Hernandez'
description = 'Periodico gratuito independiente'
cover_url = 'http://estaticos.20minutos.es/mmedia/especiales/corporativo/css/img/logotipos_grupo20minutos.gif'
oldest_article = 5
max_articles_per_feed = 100
remove_javascript = True
no_stylesheets = True
use_embedded_content = False
encoding = 'ISO-8859-1'
language = 'es'
timefmt = '[%a, %d %b, %Y]'
keep_only_tags = [
dict(name='div', attrs={'id':['content','vinetas',]})
,dict(name='div', attrs={'class':['boxed','description','lead','article-content','cuerpo estirar']})
,dict(name='span', attrs={'class':['photo-bar']})
,dict(name='ul', attrs={'class':['article-author']})
]
remove_tags_before = dict(name='ul' , attrs={'class':['servicios-sub']})
remove_tags_after = dict(name='div' , attrs={'class':['related-news','col']})
remove_tags = [
dict(name='ol', attrs={'class':['navigation',]})
,dict(name='span', attrs={'class':['action']})
,dict(name='div', attrs={'class':['twitter comments-list hidden','related-news','col','photo-gallery','calendario','article-comment','postto estirar','otras_vinetas estirar','kment','user-actions']})
,dict(name='div', attrs={'id':['twitter-destacados','eco-tabs','inner','vineta_calendario','vinetistas clearfix','otras_vinetas estirar','MIN1','main','SUP1','INT']})
,dict(name='ul', attrs={'class':['article-user-actions','stripped-list']})
,dict(name='ul', attrs={'id':['site-links']})
,dict(name='li', attrs={'class':['puntuacion','enviar','compartir']})
]
feeds = [
(u'Portada' , u'http://www.20minutos.es/rss/')
,(u'Nacional' , u'http://www.20minutos.es/rss/nacional/')
,(u'Internacional' , u'http://www.20minutos.es/rss/internacional/')
,(u'Economia' , u'http://www.20minutos.es/rss/economia/')
,(u'Deportes' , u'http://www.20minutos.es/rss/deportes/')
,(u'Tecnologia' , u'http://www.20minutos.es/rss/tecnologia/')
,(u'Gente - TV' , u'http://www.20minutos.es/rss/gente-television/')
,(u'Motor' , u'http://www.20minutos.es/rss/motor/')
,(u'Salud' , u'http://www.20minutos.es/rss/belleza-y-salud/')
,(u'Viajes' , u'http://www.20minutos.es/rss/viajes/')
,(u'Vivienda' , u'http://www.20minutos.es/rss/vivienda/')
,(u'Empleo' , u'http://www.20minutos.es/rss/empleo/')
,(u'Cine' , u'http://www.20minutos.es/rss/cine/')
,(u'Musica' , u'http://www.20minutos.es/rss/musica/')
,(u'Vinetas' , u'http://www.20minutos.es/rss/vinetas/')
,(u'Comunidad20' , u'http://www.20minutos.es/rss/zona20/')
]
|
|
|
|
|
|
#7 | |
|
Enthusiast
![]() ![]() Posts: 49
Karma: 196
Join Date: Jan 2011
Device: Kindle 3
|
Help me to improve this recipe
Hi all
I have a problem with this recipe, the original web page looks: And my recipe shows it: All the articles have the same problem, I have localized the guilty code responsible of this disaster in the original code of the web: Quote:
I tried with preprocess_regexps command but I don't know the sintax, I've the same error over and over again: Anybody can help me? Thanks (click on the images to see them bigger) |
|
|
|
|
|
|
#8 |
|
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,609
Karma: 28549044
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Add
import re near the top of your recipe |
|
|
|
|
|
#9 |
|
Enthusiast
![]() ![]() Posts: 49
Karma: 196
Join Date: Jan 2011
Device: Kindle 3
|
Nice! Fantastico! It works! Now a new version of 20 minutos is coming, stay in....
|
|
|
|
|
|
#10 |
|
Enthusiast
![]() ![]() Posts: 49
Karma: 196
Join Date: Jan 2011
Device: Kindle 3
|
20 Minutos (v0.85)
... and here is:
CHANGELOG - Changed oldest_article from 5 to 2, now the ebook is around 3 Mb - Added CSS style, looks better now - Adjust code for erase some indeseable content - Other minor changes NOTES First time using commands re, comics has no changes this time, may be in a future... (a little concepts more and may I can fix it). Excepts the comics, the recipe looks fantastic now SOURCE CODE Code:
__license__ = 'GPL v3'
__author__ = 'Luis Hernandez'
__copyright__ = 'Luis Hernandez<tolyluis@gmail.com>'
__version__ = 'v0.85'
__date__ = '31 January 2011'
'''
www.20minutos.es
'''
import re
class AdvancedUserRecipe1294946868(BasicNewsRecipe):
title = u'20 Minutos'
publisher = u'Grupo 20 Minutos'
__author__ = 'Luis Hernandez'
description = 'Free spanish newspaper'
cover_url = 'http://estaticos.20minutos.es/mmedia/especiales/corporativo/css/img/logotipos_grupo20minutos.gif'
oldest_article = 2
max_articles_per_feed = 100
remove_javascript = True
no_stylesheets = True
use_embedded_content = False
encoding = 'ISO-8859-1'
language = 'es_ES'
timefmt = '[%a, %d %b, %Y]'
remove_empty_feeds = True
keep_only_tags = [
dict(name='div', attrs={'id':['content','vinetas',]})
,dict(name='div', attrs={'class':['boxed','description','lead','article-content','cuerpo estirar']})
,dict(name='span', attrs={'class':['photo-bar']})
,dict(name='ul', attrs={'class':['article-author']})
]
remove_tags_before = dict(name='ul' , attrs={'class':['servicios-sub']})
remove_tags_after = dict(name='div' , attrs={'class':['related-news','col']})
remove_tags = [
dict(name='ol', attrs={'class':['navigation',]})
,dict(name='span', attrs={'class':['action']})
,dict(name='div', attrs={'class':['twitter comments-list hidden','related-news','col','photo-gallery','photo-gallery side-art-block','calendario','article-comment','postto estirar','otras_vinetas estirar','kment','user-actions']})
,dict(name='div', attrs={'id':['twitter-destacados','eco-tabs','inner','vineta_calendario','vinetistas clearfix','otras_vinetas estirar','MIN1','main','SUP1','INT']})
,dict(name='ul', attrs={'class':['article-user-actions','stripped-list']})
,dict(name='ul', attrs={'id':['site-links']})
,dict(name='li', attrs={'class':['puntuacion','enviar','compartir']})
]
extra_css = """
p{text-align: justify; font-size: 100%}
body{ text-align: left; font-size:100% }
h3{font-family: sans-serif; font-size:150%; font-weight:bold; text-align: justify; }
"""
preprocess_regexps = [(re.compile(r'<a href="http://estaticos.*?[0-999]px;" target="_blank">', re.DOTALL), lambda m: '')]
feeds = [
(u'Portada' , u'http://www.20minutos.es/rss/')
,(u'Nacional' , u'http://www.20minutos.es/rss/nacional/')
,(u'Internacional' , u'http://www.20minutos.es/rss/internacional/')
,(u'Economia' , u'http://www.20minutos.es/rss/economia/')
,(u'Deportes' , u'http://www.20minutos.es/rss/deportes/')
,(u'Tecnologia' , u'http://www.20minutos.es/rss/tecnologia/')
,(u'Gente - TV' , u'http://www.20minutos.es/rss/gente-television/')
,(u'Motor' , u'http://www.20minutos.es/rss/motor/')
,(u'Salud' , u'http://www.20minutos.es/rss/belleza-y-salud/')
,(u'Viajes' , u'http://www.20minutos.es/rss/viajes/')
,(u'Vivienda' , u'http://www.20minutos.es/rss/vivienda/')
,(u'Empleo' , u'http://www.20minutos.es/rss/empleo/')
,(u'Cine' , u'http://www.20minutos.es/rss/cine/')
,(u'Musica' , u'http://www.20minutos.es/rss/musica/')
,(u'Vinetas' , u'http://www.20minutos.es/rss/vinetas/')
,(u'Comunidad20' , u'http://www.20minutos.es/rss/zona20/')
]
If the language is a problem just PM to me
Last edited by tolyluis; 01-31-2011 at 08:34 PM. |
|
|
|
|
|
#11 | |
|
Connoisseur
![]() Posts: 76
Karma: 12
Join Date: Nov 2010
Device: Android, PB Pro 602
|
Quote:
Try to omit text-align: justify or change it to text-align: left in extra_css. IMHO this looks much better on mobile reading devices The cover page (logo of the periodical) does not look good. Try to find something different. Maybe you can find the title page of the daily edition. |
|
|
|
|
|
|
#12 | |
|
Enthusiast
![]() ![]() Posts: 49
Karma: 196
Join Date: Jan 2011
Device: Kindle 3
|
Quote:
At first, sorry, I don't like text-align: left, I prefer justified text. IMHO looks better in my Kindle, if you want text-align: left, just personalize it! The second suggestion will be revised in a future (with comics )Thanks for your feedback! |
|
|
|
|
|
|
#13 |
|
Connoisseur
![]() Posts: 76
Karma: 12
Join Date: Nov 2010
Device: Android, PB Pro 602
|
|
|
|
|
|
|
#14 | |
|
Enthusiast
![]() ![]() Posts: 49
Karma: 196
Join Date: Jan 2011
Device: Kindle 3
|
Quote:
Just change the title (set it to "mi 20 minutos" i.e), press Add/update recipe and voila! a new personaliced recipe that not affects updates. Last edited by tolyluis; 02-01-2011 at 07:14 PM. |
|
|
|
|
![]() |
|
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| 20 Minutos (boletín) + La tribuna de Talavera | tolyluis | Recipes | 3 | 01-28-2011 01:46 PM |
| Just Got A Kindle, Next Steps? | grechzoo | General Discussions | 17 | 05-23-2010 10:20 AM |
| Best first steps with Kindle | ficbot | Amazon Kindle | 16 | 01-16-2010 07:20 PM |
| ereader2ereader in two steps | =X= | Workshop | 15 | 12-15-2009 08:58 PM |
| interim conversion steps | ambertape | Sony Reader | 6 | 04-14-2008 02:34 PM |