01-24-2011, 11:06 PM | #1 |
Enthusiast
Posts: 49
Karma: 196
Join Date: Jan 2011
Device: Kindle 3
|
20 Minutos - First steps
Hi again:
I have made a first-attempt to take 20 Minutos web page in a readable ebook. The result is very good, is (almost) perfect. It takes a lot of time (about 5 mins) and space to make a HUGE file* (about 4 Mb) of this well-know-in-internet online spanish newspaper. I think my work on it increases the offer in spanish newspapers for Calibre . But it has some LIMITATIONS: a) Don't get comics (Viñetas), I don't know (yet); b) I don't take the local news feeds, is a HUGE file as is now This is my recipe: 20minutos.es - One of the most visited spanish web newspaper Code:
__license__ = 'GPL v3' __author__ = 'Luis Hernandez' __copyright__ = 'Luis Hernandez<tolyluis@gmail.com>' description = 'Periódico gratuito en español - v0.5 - 25 Jan 2011' ''' www.20minutos.es ''' class AdvancedUserRecipe1294946868(BasicNewsRecipe): title = u'20 Minutos' publisher = u'Grupo 20 Minutos' __author__ = 'Luis Hernández' description = 'Periódico gratuito en español' cover_url = 'http://estaticos.20minutos.es/mmedia/especiales/corporativo/css/img/logotipos_grupo20minutos.gif' oldest_article = 5 max_articles_per_feed = 100 remove_javascript = True no_stylesheets = True use_embedded_content = False encoding = 'ISO-8859-1' language = 'es' timefmt = '[%a, %d %b, %Y]' keep_only_tags = [dict(name='div', attrs={'id':['content']}) ,dict(name='div', attrs={'class':['boxed','description','lead','article-content']}) ,dict(name='span', attrs={'class':['photo-bar']}) ,dict(name='ul', attrs={'class':['article-author']}) ] remove_tags_before = dict(name='ul' , attrs={'class':['servicios-sub']}) remove_tags_after = dict(name='div' , attrs={'class':['related-news','col']}) remove_tags = [ dict(name='ol', attrs={'class':['navigation',]}) ,dict(name='span', attrs={'class':['action']}) ,dict(name='div', attrs={'class':['twitter comments-list hidden','related-news','col']}) ,dict(name='div', attrs={'id':['twitter-destacados']}) ,dict(name='ul', attrs={'class':['article-user-actions','stripped-list']}) ] feeds = [ (u'Portada' , u'http://www.20minutos.es/rss/') ,(u'Nacional' , u'http://www.20minutos.es/rss/nacional/') ,(u'Internacional' , u'http://www.20minutos.es/rss/internacional/') ,(u'Economia' , u'http://www.20minutos.es/rss/economia/') ,(u'Deportes' , u'http://www.20minutos.es/rss/deportes/') ,(u'Tecnologia' , u'http://www.20minutos.es/rss/tecnologia/') ,(u'Gente - TV' , u'http://www.20minutos.es/rss/gente-television/') ,(u'Motor' , u'http://www.20minutos.es/rss/motor/') ,(u'Salud' , u'http://www.20minutos.es/rss/belleza-y-salud/') ,(u'Viajes' , u'http://www.20minutos.es/rss/viajes/') ,(u'Vivienda' , u'http://www.20minutos.es/rss/vivienda/') ,(u'Empleo' , u'http://www.20minutos.es/rss/empleo/') ,(u'Cine' , u'http://www.20minutos.es/rss/cine/') ,(u'Musica' , u'http://www.20minutos.es/rss/musica/') ,(u'Comunidad20' , u'http://www.20minutos.es/rss/zona20/') ] *With oldest_article = 5, you can change the days to your needs. |
01-25-2011, 11:38 AM | #2 |
creator of calibre
Posts: 43,866
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Will be in next release
|
Advert | |
|
01-26-2011, 10:20 AM | #3 |
Junior Member
Posts: 8
Karma: 10
Join Date: Jan 2011
Device: Kindle 3
|
Tolyluis,
Please can you do the same for the viñetas/comics of 20 minutes? Maybe a different recipe? Thank you. |
01-27-2011, 11:01 AM | #4 |
Enthusiast
Posts: 49
Karma: 196
Join Date: Jan 2011
Device: Kindle 3
|
20 minutos (v1.2)
Hi again.
I worked over this recipe last night, I've a new version WITH comics. CHANGELOG v0.8 - Adjust code for erase some indeseable content - Added comics (viñetas) with bugs (may be repaired) Source Code: Code:
__license__ = 'GPL v3' __author__ = 'Luis Hernandez' __copyright__ = 'Luis Hernandez<tolyluis@gmail.com>' description = 'Periódico gratuito en español - v0.8 - 27 Jan 2011' ''' www.20minutos.es ''' class AdvancedUserRecipe1294946868(BasicNewsRecipe): title = u'20 Minutos' publisher = u'Grupo 20 Minutos' __author__ = 'Luis Hernández' description = 'Periódico gratuito en español' cover_url = 'http://estaticos.20minutos.es/mmedia/especiales/corporativo/css/img/logotipos_grupo20minutos.gif' oldest_article = 5 max_articles_per_feed = 100 remove_javascript = True no_stylesheets = True use_embedded_content = False encoding = 'ISO-8859-1' language = 'es' timefmt = '[%a, %d %b, %Y]' keep_only_tags = [ dict(name='div', attrs={'id':['content','vinetas',]}) ,dict(name='div', attrs={'class':['boxed','description','lead','article-content','cuerpo estirar']}) ,dict(name='span', attrs={'class':['photo-bar']}) ,dict(name='ul', attrs={'class':['article-author']}) ] remove_tags_before = dict(name='ul' , attrs={'class':['servicios-sub']}) remove_tags_after = dict(name='div' , attrs={'class':['related-news','col']}) remove_tags = [ dict(name='ol', attrs={'class':['navigation',]}) ,dict(name='span', attrs={'class':['action']}) ,dict(name='div', attrs={'class':['twitter comments-list hidden','related-news','col','photo-gallery','calendario','article-comment','postto estirar','otras_vinetas estirar','kment','user-actions']}) ,dict(name='div', attrs={'id':['twitter-destacados','eco-tabs','inner','vineta_calendario','vinetistas clearfix','otras_vinetas estirar','MIN1','main','SUP1','INT']}) ,dict(name='ul', attrs={'class':['article-user-actions','stripped-list']}) ,dict(name='ul', attrs={'id':['site-links']}) ,dict(name='li', attrs={'class':['puntuacion','enviar','compartir']}) ] feeds = [ (u'Portada' , u'http://www.20minutos.es/rss/') ,(u'Nacional' , u'http://www.20minutos.es/rss/nacional/') ,(u'Internacional' , u'http://www.20minutos.es/rss/internacional/') ,(u'Economia' , u'http://www.20minutos.es/rss/economia/') ,(u'Deportes' , u'http://www.20minutos.es/rss/deportes/') ,(u'Tecnologia' , u'http://www.20minutos.es/rss/tecnologia/') ,(u'Gente - TV' , u'http://www.20minutos.es/rss/gente-television/') ,(u'Motor' , u'http://www.20minutos.es/rss/motor/') ,(u'Salud' , u'http://www.20minutos.es/rss/belleza-y-salud/') ,(u'Viajes' , u'http://www.20minutos.es/rss/viajes/') ,(u'Vivienda' , u'http://www.20minutos.es/rss/vivienda/') ,(u'Empleo' , u'http://www.20minutos.es/rss/empleo/') ,(u'Cine' , u'http://www.20minutos.es/rss/cine/') ,(u'Musica' , u'http://www.20minutos.es/rss/musica/') ,(u'Vinetas' , u'http://www.20minutos.es/rss/vinetas/') ,(u'Comunidad20' , u'http://www.20minutos.es/rss/zona20/') ] Hope you enjoy this version. I will like some feedback. |
01-28-2011, 03:17 AM | #5 |
Junior Member
Posts: 8
Karma: 10
Join Date: Jan 2011
Device: Kindle 3
|
This afternoon
I will test it thanks for your work!.
I will give you some feedback tonight. |
Advert | |
|
01-28-2011, 12:34 PM | #6 |
Enthusiast
Posts: 49
Karma: 196
Join Date: Jan 2011
Device: Kindle 3
|
20 Minutos (v0.8 ct)
A little changes is necesary in the code for optimal perfomance in testing mode using command ebook-export, no changes made in the "real" code, just has been erased some non-ascii characters.
SOURCE CODE Code:
__license__ = 'GPL v3' __author__ = 'Luis Hernandez' __copyright__ = 'Luis Hernandez<tolyluis@gmail.com>' ''' www.20minutos.es ''' class AdvancedUserRecipe1294946868(BasicNewsRecipe): title = u'20 Minutos' publisher = u'Grupo 20 Minutos' __author__ = 'Luis Hernandez' description = 'Periodico gratuito independiente' cover_url = 'http://estaticos.20minutos.es/mmedia/especiales/corporativo/css/img/logotipos_grupo20minutos.gif' oldest_article = 5 max_articles_per_feed = 100 remove_javascript = True no_stylesheets = True use_embedded_content = False encoding = 'ISO-8859-1' language = 'es' timefmt = '[%a, %d %b, %Y]' keep_only_tags = [ dict(name='div', attrs={'id':['content','vinetas',]}) ,dict(name='div', attrs={'class':['boxed','description','lead','article-content','cuerpo estirar']}) ,dict(name='span', attrs={'class':['photo-bar']}) ,dict(name='ul', attrs={'class':['article-author']}) ] remove_tags_before = dict(name='ul' , attrs={'class':['servicios-sub']}) remove_tags_after = dict(name='div' , attrs={'class':['related-news','col']}) remove_tags = [ dict(name='ol', attrs={'class':['navigation',]}) ,dict(name='span', attrs={'class':['action']}) ,dict(name='div', attrs={'class':['twitter comments-list hidden','related-news','col','photo-gallery','calendario','article-comment','postto estirar','otras_vinetas estirar','kment','user-actions']}) ,dict(name='div', attrs={'id':['twitter-destacados','eco-tabs','inner','vineta_calendario','vinetistas clearfix','otras_vinetas estirar','MIN1','main','SUP1','INT']}) ,dict(name='ul', attrs={'class':['article-user-actions','stripped-list']}) ,dict(name='ul', attrs={'id':['site-links']}) ,dict(name='li', attrs={'class':['puntuacion','enviar','compartir']}) ] feeds = [ (u'Portada' , u'http://www.20minutos.es/rss/') ,(u'Nacional' , u'http://www.20minutos.es/rss/nacional/') ,(u'Internacional' , u'http://www.20minutos.es/rss/internacional/') ,(u'Economia' , u'http://www.20minutos.es/rss/economia/') ,(u'Deportes' , u'http://www.20minutos.es/rss/deportes/') ,(u'Tecnologia' , u'http://www.20minutos.es/rss/tecnologia/') ,(u'Gente - TV' , u'http://www.20minutos.es/rss/gente-television/') ,(u'Motor' , u'http://www.20minutos.es/rss/motor/') ,(u'Salud' , u'http://www.20minutos.es/rss/belleza-y-salud/') ,(u'Viajes' , u'http://www.20minutos.es/rss/viajes/') ,(u'Vivienda' , u'http://www.20minutos.es/rss/vivienda/') ,(u'Empleo' , u'http://www.20minutos.es/rss/empleo/') ,(u'Cine' , u'http://www.20minutos.es/rss/cine/') ,(u'Musica' , u'http://www.20minutos.es/rss/musica/') ,(u'Vinetas' , u'http://www.20minutos.es/rss/vinetas/') ,(u'Comunidad20' , u'http://www.20minutos.es/rss/zona20/') ] |
01-31-2011, 02:32 PM | #7 | |
Enthusiast
Posts: 49
Karma: 196
Join Date: Jan 2011
Device: Kindle 3
|
Help me to improve this recipe
Hi all
I have a problem with this recipe, the original web page looks: And my recipe shows it: All the articles have the same problem, I have localized the guilty code responsible of this disaster in the original code of the web: Quote:
I tried with preprocess_regexps command but I don't know the sintax, I've the same error over and over again: Anybody can help me? Thanks (click on the images to see them bigger) |
|
01-31-2011, 03:10 PM | #8 |
creator of calibre
Posts: 43,866
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Add
import re near the top of your recipe |
01-31-2011, 04:01 PM | #9 |
Enthusiast
Posts: 49
Karma: 196
Join Date: Jan 2011
Device: Kindle 3
|
Nice! Fantastico! It works! Now a new version of 20 minutos is coming, stay in....
|
01-31-2011, 07:10 PM | #10 |
Enthusiast
Posts: 49
Karma: 196
Join Date: Jan 2011
Device: Kindle 3
|
20 Minutos (v0.85)
... and here is:
CHANGELOG - Changed oldest_article from 5 to 2, now the ebook is around 3 Mb - Added CSS style, looks better now - Adjust code for erase some indeseable content - Other minor changes NOTES First time using commands re, comics has no changes this time, may be in a future... (a little concepts more and may I can fix it). Excepts the comics, the recipe looks fantastic now SOURCE CODE Code:
__license__ = 'GPL v3' __author__ = 'Luis Hernandez' __copyright__ = 'Luis Hernandez<tolyluis@gmail.com>' __version__ = 'v0.85' __date__ = '31 January 2011' ''' www.20minutos.es ''' import re class AdvancedUserRecipe1294946868(BasicNewsRecipe): title = u'20 Minutos' publisher = u'Grupo 20 Minutos' __author__ = 'Luis Hernandez' description = 'Free spanish newspaper' cover_url = 'http://estaticos.20minutos.es/mmedia/especiales/corporativo/css/img/logotipos_grupo20minutos.gif' oldest_article = 2 max_articles_per_feed = 100 remove_javascript = True no_stylesheets = True use_embedded_content = False encoding = 'ISO-8859-1' language = 'es_ES' timefmt = '[%a, %d %b, %Y]' remove_empty_feeds = True keep_only_tags = [ dict(name='div', attrs={'id':['content','vinetas',]}) ,dict(name='div', attrs={'class':['boxed','description','lead','article-content','cuerpo estirar']}) ,dict(name='span', attrs={'class':['photo-bar']}) ,dict(name='ul', attrs={'class':['article-author']}) ] remove_tags_before = dict(name='ul' , attrs={'class':['servicios-sub']}) remove_tags_after = dict(name='div' , attrs={'class':['related-news','col']}) remove_tags = [ dict(name='ol', attrs={'class':['navigation',]}) ,dict(name='span', attrs={'class':['action']}) ,dict(name='div', attrs={'class':['twitter comments-list hidden','related-news','col','photo-gallery','photo-gallery side-art-block','calendario','article-comment','postto estirar','otras_vinetas estirar','kment','user-actions']}) ,dict(name='div', attrs={'id':['twitter-destacados','eco-tabs','inner','vineta_calendario','vinetistas clearfix','otras_vinetas estirar','MIN1','main','SUP1','INT']}) ,dict(name='ul', attrs={'class':['article-user-actions','stripped-list']}) ,dict(name='ul', attrs={'id':['site-links']}) ,dict(name='li', attrs={'class':['puntuacion','enviar','compartir']}) ] extra_css = """ p{text-align: justify; font-size: 100%} body{ text-align: left; font-size:100% } h3{font-family: sans-serif; font-size:150%; font-weight:bold; text-align: justify; } """ preprocess_regexps = [(re.compile(r'<a href="http://estaticos.*?[0-999]px;" target="_blank">', re.DOTALL), lambda m: '')] feeds = [ (u'Portada' , u'http://www.20minutos.es/rss/') ,(u'Nacional' , u'http://www.20minutos.es/rss/nacional/') ,(u'Internacional' , u'http://www.20minutos.es/rss/internacional/') ,(u'Economia' , u'http://www.20minutos.es/rss/economia/') ,(u'Deportes' , u'http://www.20minutos.es/rss/deportes/') ,(u'Tecnologia' , u'http://www.20minutos.es/rss/tecnologia/') ,(u'Gente - TV' , u'http://www.20minutos.es/rss/gente-television/') ,(u'Motor' , u'http://www.20minutos.es/rss/motor/') ,(u'Salud' , u'http://www.20minutos.es/rss/belleza-y-salud/') ,(u'Viajes' , u'http://www.20minutos.es/rss/viajes/') ,(u'Vivienda' , u'http://www.20minutos.es/rss/vivienda/') ,(u'Empleo' , u'http://www.20minutos.es/rss/empleo/') ,(u'Cine' , u'http://www.20minutos.es/rss/cine/') ,(u'Musica' , u'http://www.20minutos.es/rss/musica/') ,(u'Vinetas' , u'http://www.20minutos.es/rss/vinetas/') ,(u'Comunidad20' , u'http://www.20minutos.es/rss/zona20/') ] Last edited by tolyluis; 01-31-2011 at 07:34 PM. |
02-01-2011, 03:14 AM | #11 | |
Connoisseur
Posts: 76
Karma: 12
Join Date: Nov 2010
Device: Android, PB Pro 602
|
Quote:
Try to omit text-align: justify or change it to text-align: left in extra_css. IMHO this looks much better on mobile reading devices The cover page (logo of the periodical) does not look good. Try to find something different. Maybe you can find the title page of the daily edition. |
|
02-01-2011, 09:38 AM | #12 | |
Enthusiast
Posts: 49
Karma: 196
Join Date: Jan 2011
Device: Kindle 3
|
Quote:
At first, sorry, I don't like text-align: left, I prefer justified text. IMHO looks better in my Kindle, if you want text-align: left, just personalize it! The second suggestion will be revised in a future (with comics ) Thanks for your feedback! |
|
02-01-2011, 10:02 AM | #13 |
Connoisseur
Posts: 76
Karma: 12
Join Date: Nov 2010
Device: Android, PB Pro 602
|
|
02-01-2011, 11:20 AM | #14 | |
Enthusiast
Posts: 49
Karma: 196
Join Date: Jan 2011
Device: Kindle 3
|
Quote:
Just change the title (set it to "mi 20 minutos" i.e), press Add/update recipe and voila! a new personaliced recipe that not affects updates. Last edited by tolyluis; 02-01-2011 at 06:14 PM. |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
20 Minutos (boletín) + La tribuna de Talavera | tolyluis | Recipes | 3 | 01-28-2011 12:46 PM |
Just Got A Kindle, Next Steps? | grechzoo | General Discussions | 17 | 05-23-2010 09:20 AM |
Best first steps with Kindle | ficbot | Amazon Kindle | 16 | 01-16-2010 06:20 PM |
ereader2ereader in two steps | =X= | Workshop | 15 | 12-15-2009 07:58 PM |
interim conversion steps | ambertape | Sony Reader | 6 | 04-14-2008 01:34 PM |