07-08-2016, 07:42 AM | #1 |
Enthusiast
Posts: 43
Karma: 10
Join Date: Oct 2012
Location: Belgium
Device: Promedia e-reader (Onyx C67ML) - Aldi2016 / Former: Sony PRS-T1&T2
|
Belgian-Dutch recipes Broken (for some time)
None of the Belgian (Dutch) built in recipes still work.
As I am a newbie when it comes to recipes, mentioning that they do not work my contribution. Sorry. What I get is: a menu, a paragraph menu, but no artikels. It happens in all Belgian news sources. I haven't checked the Dutch sources yet. | Volgende | Paragraafmenu | Hoofdmenu | This article was downloaded by calibre fromh http://www.gva.be/cnt/dmf20160708_02...haven-zaventem | Paragraafmenu | Hoofdmenu | |
07-08-2016, 08:27 AM | #2 |
creator of calibre
Posts: 43,842
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
I dont maintain recipes for languages I cannot read, as that makes it much harder to understand the website being scraped. So you will have to hope that someone who both reads the language and knows how to code is willing tohelp.
|
07-08-2016, 09:03 AM | #3 | |
Enthusiast
Posts: 43
Karma: 10
Join Date: Oct 2012
Location: Belgium
Device: Promedia e-reader (Onyx C67ML) - Aldi2016 / Former: Sony PRS-T1&T2
|
Thanks for the quick reply
Quote:
If nobody comes forward these days I would suggest to delete the recipes. Is someone does come forward i can help, as the Belgian newspaper marked has changed considerbly. Keep on the good work. |
|
07-08-2016, 01:45 PM | #4 |
Bookish
Posts: 907
Karma: 1803094
Join Date: Jun 2011
Device: PC, t1, t2, t3, aura 2 v1, clara HD, Libra 2, Nxtpaper 11
|
It would help to mention which ones did not work for you ...
I just tried some Belgian and Dutch recipes and they all work. Yes, they are slow loading, so be patient! |
07-08-2016, 04:17 PM | #5 |
Member
Posts: 16
Karma: 10
Join Date: Apr 2016
Device: Tolino Vision 3HD
|
Well, the downloaded files do not contain any articles, so I would indeed say they are broken. I just had a quick look at the GVA recipe mentioned in the first post. It was easy to fix, but I'll leave the other recipes to someone else.
Update for gva_be.recipe: Code:
#!/usr/bin/env python2 # vim:fileencoding=utf-8 from __future__ import unicode_literals, division, absolute_import, print_function __license__ = 'GPL v3' __copyright__ = '2009, Darko Miletic <darko.miletic at gmail.com>' ''' www.gva.be ''' import re from calibre.web.feeds.news import BasicNewsRecipe class GazetvanAntwerpen(BasicNewsRecipe): title = 'Gazet van Antwerpen' __author__ = 'Darko Miletic' description = 'News from Belgium in Dutch' publisher = 'Gazet van Antwerpen' category = 'news, politics, Belgium' language = 'nl_BE' oldest_article = 2 max_articles_per_feed = 100 no_stylesheets = True use_embedded_content = False remove_javascript = True masthead_url = 'http://2.gvacdn.be/extra/assets/img/gazet-van-antwerpen-red.svg' feeds = [ ('Stad & Regio', 'http://www.gva.be/syndicationservices/artfeedservice.svc/rss/mostrecent/stadenregio'), ('Economie', 'http://www.gva.be/syndicationservices/artfeedservice.svc/rss/mostrecent/economie'), ('Binnenland', 'http://www.gva.be/syndicationservices/artfeedservice.svc/rss/mostrecent/binnenland'), ('Buitenland', 'http://www.gva.be/syndicationservices/artfeedservice.svc/rss/mostrecent/buitenland'), ('Media & Cultuur', 'http://www.gva.be/syndicationservices/artfeedservice.svc/rss/mostrecent/mediaencultuur'), ('Sport', 'http://www.gva.be/syndicationservices/artfeedservice.svc/rss/mostrecent/sport') ] keep_only_tags = [ dict(name='header', attrs={'class':'article__header'}), dict(name='footer', attrs={'class':'article__meta'}), dict(name='div', attrs={'class':['article', 'article__body', 'slideshow__intro']}), dict(name='figure', attrs={'class':'article__image'}) ] remove_tags = [ dict(name=['embed', 'object']), dict(name='div', attrs={'class':['note NotePortrait', 'note']}), dict(name='ul', attrs={'class':re.compile('article__share')}), dict(name='div', attrs={'class':'slideshow__controls'}), dict(name='a', attrs={'role':'button'}), dict(name='figure', attrs={'class':re.compile('video')}) ] remove_attributes = ['width', 'height'] def preprocess_html(self, soup): del soup.body['onload'] for item in soup.findAll(style=True): del item['style'] return soup |
07-09-2016, 06:37 PM | #6 |
Enthusiast
Posts: 43
Karma: 10
Join Date: Oct 2012
Location: Belgium
Device: Promedia e-reader (Onyx C67ML) - Aldi2016 / Former: Sony PRS-T1&T2
|
|
07-10-2016, 04:35 AM | #7 |
Member
Posts: 16
Karma: 10
Join Date: Apr 2016
Device: Tolino Vision 3HD
|
Hi Kunvp,
looking at my changes to the gva_be.recipe will probably not help you very much to understand how to work on other recipes. I removed some obsolete code which makes the change look bigger than it actually was. As far as I can see, all of the Belgian Dutch news sources have a valid table of contents. This means the feed addresses are still correct, but there's something wrong with the extraction of the content. Modifying the keep_only_tags and remove_tags sections should be sufficient in this case. For example, if you look at the demorgen_be.recipe you will find the line: Code:
keep_only_tags = [dict(name='div' , attrs={'class':'art_box2'})] Code:
keep_only_tags = [dict(name='div' , attrs={'class':'article__wrapper'})] For an in-depth explanation of recipe programming just have a look at the Calibre documentation: https://manual.calibre-ebook.com/news.html |
07-11-2016, 05:28 AM | #8 |
Enthusiast
Posts: 43
Karma: 10
Join Date: Oct 2012
Location: Belgium
Device: Promedia e-reader (Onyx C67ML) - Aldi2016 / Former: Sony PRS-T1&T2
|
I'll give it a try, but don't expect it by tomorrow.
Thank you Aimylios, your explanation is very useful in starting to understanding the issue.
Back at school, years, decades actually, ago I had to write scripts to convert layout code from pc to hi-end systems. This actually looks kind of similar. I'll give it a try, but don't expect it by tomorrow. :-) |
07-28-2016, 05:11 AM | #9 | |
Member
Posts: 11
Karma: 10
Join Date: Feb 2010
Location: Belgium
Device: Foxit eSlick > Amazon Kindle PW3
|
Quote:
Code:
publisher = 'Gazet van Antwerpen' Code:
publisher = 'Mediahuis' Using the example above I've come up with a recipe for another newspaper from the same publisher; Het Nieuwsblad: Code:
#!/usr/bin/env python2 # vim:fileencoding=utf-8 from __future__ import unicode_literals, division, absolute_import, print_function from calibre.web.feeds.news import BasicNewsRecipe class AdvancedUserRecipe1467571059(BasicNewsRecipe): title = 'Het Nieuwsblad' __author__ = 'Darko Miletic, Aimylios, oCkz7bJ_' description = 'Het Nieuwsblad is goed voor u.' publisher = 'Mediahuis' category = 'news, politics, Belgium' language = 'nl_BE' oldest_article = 2 max_articles_per_feed = 100 no_stylesheets = True use_embedded_content = False remove_javascript = True cover_url = 'http://www.lottocyclingcup.be/lc15/dendermedia/images/details/foto/partners_36/nieuwsblad_1_20160210_1242138359.jpg' masthead_url = 'http://www.mediahuisconnect.be/uploads/media/5576fa0b83c38/nieuwsblad.svg' #Source: http://www.nieuwsblad.be/rss feeds = [ # Nieuws ('Snelnieuws', 'http://feeds.nieuwsblad.be/nieuws/snelnieuws'), ('Binnenland', 'http://feeds.nieuwsblad.be/nieuws/binnenland'), ('Buitenland', 'http://feeds.nieuwsblad.be/nieuwsblad/buitenland'), # Economie ('Economie', 'http://feeds.nieuwsblad.be/economie/home'), ('Consument', 'http://feeds.nieuwsblad.be/economie/algemeen'), ('Bedrijven', 'http://feeds.nieuwsblad.be/economie/bedrijven'), ('Werk', 'http://feeds.nieuwsblad.be/economie/Werk'), ('Beurs', 'http://feeds.nieuwsblad.be/economie/beurs'), # Regio #('0123 Region1', 'http://www.nieuwsblad.be/rss.aspx?intro=1§ion=postcode&postcode=0123'), #('3456 Region2', 'http://www.nieuwsblad.be/rss.aspx?intro=1§ion=postcode&postcode=3456'), #('6789 Region3', 'http://www.nieuwsblad.be/rss.aspx?intro=1§ion=postcode&postcode=6789'), # Sport ('Voetbal', 'http://feeds.nieuwsblad.be/nieuwsblad/sport/voetbal'), ('Wielrennen', 'http://feeds.nieuwsblad.be/nieuwsblad/sport/wielrennen'), ('Tennis', 'http://feeds.nieuwsblad.be/nieuwsblad/sport/tennis'), ('Autosport', 'http://feeds.nieuwsblad.be/nieuwsblad/sport/autosport'), ('Basketbal', 'http://feeds.nieuwsblad.be/nieuwsblad/sport/basketbal'), ('Volleybal', 'http://feeds.nieuwsblad.be/nieuwsblad/sport/volleybal'), ('Atletiek', 'http://feeds.nieuwsblad.be/nieuwsblad/sport/atletiek'), # Extra ('Film', 'http://feeds.nieuwsblad.be/life/film'), ('Boek', 'http://feeds.nieuwsblad.be/life/boeken'), ('Muziek', 'http://feeds.nieuwsblad.be/life/muziek'), ('Podium', 'http://feeds.nieuwsblad.be/life/podium'), ('TV & Radio', 'http://feeds.nieuwsblad.be/life/tv'), # She ('BV & Co', 'http://feeds.nieuwsblad.be/life/bv'), ('Mode & Design', 'http://feeds.nieuwsblad.be/life/mode'), ('Culinair', 'http://feeds.nieuwsblad.be/life/culinair'), ('Gezondheid', 'http://feeds.nieuwsblad.be/life/gezondheid'), ('Reizen', 'http://feeds.nieuwsblad.be/life/reizen'), ('Dieren', 'http://feeds.nieuwsblad.be/life/dieren'), # Weblog ('Surfplank', 'http://nieuwsblad.typepad.com/surfplank/atom.xml'), ('Boeken', 'http://nieuwsblad.typepad.com/boeken/atom.xml'), ('Strips', 'http://nieuwsblad.typepad.com/strips/atom.xml'), ('DVD', 'http://nieuwsblad.typepad.com/dvd/atom.xml'), ('Dierendoktor', 'http://nieuwsblad.typepad.com/dierendokter/atom.xml'), ('Zapdog', 'http://nieuwsblad.typepad.com/zapdog/atom.xml'), ] keep_only_tags = [ dict(name='header', attrs={'class':'article__header'}), dict(name='footer', attrs={'class':'article__meta'}), dict(name='div', attrs={'class':['article', 'article__body', 'slideshow__intro']}), dict(name='figure', attrs={'class':'article__image'}) ] remove_tags = [ dict(name=['embed', 'object']), dict(name='div', attrs={'class':['note NotePortrait', 'note']}), dict(name='ul', attrs={'class':re.compile('article__share')}), dict(name='div', attrs={'class':'slideshow__controls'}), dict(name='a', attrs={'role':'button'}), dict(name='figure', attrs={'class':re.compile('video')}) ] remove_attributes = ['width', 'height'] def preprocess_html(self, soup): del soup.body['onload'] for item in soup.findAll(style=True): del item['style'] return soup Last edited by oCkz7bJ_; 07-31-2016 at 08:06 AM. Reason: Alignment |
|
07-28-2016, 07:36 AM | #10 |
Member
Posts: 11
Karma: 10
Join Date: Feb 2010
Location: Belgium
Device: Foxit eSlick > Amazon Kindle PW3
|
Here's a fairly simple one for DataNews:
Code:
#!/usr/bin/env python2 # vim:fileencoding=utf-8 from __future__ import unicode_literals, division, absolute_import, print_function from calibre.web.feeds.news import BasicNewsRecipe class AdvancedUserRecipe1468055030(BasicNewsRecipe): title = 'DataNews' __author__ = 'oCkz7bJ_' description = 'Technology / Best Practice / Business' publisher = 'Roularta Media Group' category = 'news, information technology, Belgium' language = 'nl_BE' oldest_article = 2 max_articles_per_feed = 100 auto_cleanup = True no_stylesheets = True use_embedded_content = False remove_javascript = True cover_url = 'http://datablend.be/wp-content/uploads/2014/01/Data_News_logo-short.jpg' masthead_url = 'http://datanews.knack.be/images/svg/logos/logo_Site-DataNews-NL.svg' # Source: http://datanews.knack.be/rss/ feeds = [ ('Technology', 'http://datanews.knack.be/ict/feed.rss'), ('Opinie', 'http://datanews.knack.be/ict/opinie/feed.rss'), ('Gadgets', 'http://datanews.knack.be/ict/gadgets/feed.rss'), ('Foto', 'http://datanews.knack.be/ict/foto/feed.rss'), ('Nieuws', 'http://datanews.knack.be/ict/nieuws/feed.rss'), ('Reviews', 'http://datanews.knack.be/ict/reviews/feed.rss'), ('Startups', 'http://datanews.knack.be/ict/start-ups/feed.rss'), ] Last edited by oCkz7bJ_; 07-28-2016 at 07:37 AM. Reason: added link to datanews |
07-29-2016, 04:37 AM | #11 | |
Enthusiast
Posts: 43
Karma: 10
Join Date: Oct 2012
Location: Belgium
Device: Promedia e-reader (Onyx C67ML) - Aldi2016 / Former: Sony PRS-T1&T2
|
Mediahuis - some words on
Quote:
Dear oCkz7bJ_, Thank you so much for having a look at this. I have too little knowledge to make clean scripts. To answer your question straightforward: I wouldn't go for the title 'Mediahuis' as users in Belgium don't recognise this as a news source. Unless, you don't want to change the title of the sourde. Because it is correct that the publisher is Mediahuis". Mediahuis (°2013) is a joint venture of two publishers (newspaper and onlinenews site). (wikiperdia) They run the following sites: 1. Het Nieuwsblad De Gentenaar" 2. Gazet Van Antwerpen 3. Het belang van Limburg The different titles have different regional content. All this titles have a strong focus on regional contend (in the paper edition) This is reflected on there websites. De Standaard is the so called quality newspaper/branch of the group. They use a lot of content from the above mentioned Het Nieuwsblad, but they have more Editorials, opinon etc. All Belgian news sources have decreased the lenght and number of FREE articles. To summarise: - Mediahuis will not be recognised by users. It is not a problem when the this does niot influence the title visible to users. - It can be expected that the scripts of different titles will be fairly identical. - The regional content should be different Does all this answer your question? I have the feeling I made it all too complex. :-p Many greetings, Koen Last edited by Kunvp; 07-29-2016 at 04:40 AM. Reason: I did'nt understand the question. :-p |
|
07-29-2016, 05:51 AM | #12 | |||
Member
Posts: 11
Karma: 10
Join Date: Feb 2010
Location: Belgium
Device: Foxit eSlick > Amazon Kindle PW3
|
Quote:
Code:
title = 'Gazet van Antwerpen' publisher = 'Gazet van Antwerpen' Code:
title = 'Gazet van Antwerpen' publisher = 'Mediahuis' Quote:
Quote:
I suspect even a digital subscription will not provide a full content RSS feed, it's either app (iOS & Android only) or website. I'll try asking around "via via". I'd consider a subscription if they would publish their newspaper in proper ebook format. (None of the Belgian publishers do AFAIK). |
|||
08-04-2016, 02:44 PM | #13 |
Member
Posts: 11
Karma: 10
Join Date: Feb 2010
Location: Belgium
Device: Foxit eSlick > Amazon Kindle PW3
|
Here's a recipe that seems to work for De Standaard:
Code:
#!/usr/bin/env python2 # vim:fileencoding=utf-8 from __future__ import unicode_literals, division, absolute_import, print_function from calibre.web.feeds.news import BasicNewsRecipe class AdvancedUserRecipe1467571059(BasicNewsRecipe): title = 'De Standaard' __author__ = 'Darko Miletic, Aimylios, oCkz7bJ_' description = 'De Standaard' publisher = 'Mediahuis' category = 'news, politics, Belgium' language = 'nl_BE' oldest_article = 2 max_articles_per_feed = 100 no_stylesheets = True use_embedded_content = False remove_javascript = True cover_url = 'http://www.standaard.be/extra/assets/extra/dslive/headers/ds-black.svg' masthead_url = 'http://tonysweb.be/m/img/tijdschriften/de_standaard.svg' #Source: http://www.standaard.be/rssfeeds feeds = [ # Nieuws ('Binnenland', 'http://www.standaard.be/rss/section/1f2838d4-99ea-49f0-9102-138784c7ea7c'), ('Buitenland', 'http://www.standaard.be/rss/section/e70ccf13-a2f0-42b0-8bd3-e32d424a0aa0'), ('Cultuur', 'http://www.standaard.be/rss/section/ab8d3fd8-bf2f-487a-818b-9ea546e9a859'), ('Media', 'http://www.standaard.be/rss/section/eb1a6433-ca3f-4a3b-ab48-a81a5fb8f6e2'), ('Economie', 'http://www.standaard.be/rss/section/451c8e1e-f9e4-450e-aa1f-341eab6742cc'), ('Sport', 'http://www.standaard.be/rss/section/8f693cea-dba8-46e4-8575-807d1dc2bcb7'), ('Beroemd en Bizar', 'http://www.standaard.be/rss/section/113a9a78-f65a-47a8-bd1c-b24483321d0f'), # Standaard.biz ('Overzicht', 'http://www.standaard.be/rss/section/a30afc42-3737-4301-8f8a-5b6833855457'), ('Economie', 'http://www.standaard.be/rss/section/212b8b54-bd91-4c8b-942c-8029e8797d36'), ('Bedrijven', 'http://www.standaard.be/rss/section/6aa8d4fa-4b9a-40d5-aa8f-87ac72472f27'), ('Consument', 'http://www.standaard.be/rss/section/46025691-2ec4-4a06-b6d7-9773686a24a7'), ('Beurs', 'http://www.standaard.be/rss/section/74cef9d1-3b28-4b90-943a-ce685bf6ed6e'), ('Marketing & Media', 'http://www.standaard.be/rss/section/9bdf4a14-f8bf-4439-aaf1-344181f73e73'), ('Mobilia', 'http://www.standaard.be/rss/section/270b7f8f-dd73-44cb-b622-9f7200a439a7'), # Lifestyle ('Mode', 'http://www.standaard.be/rss/section/3a4b39a1-e58f-42e4-8ae9-a0f90f97f27f'), ('Beauty', 'http://www.standaard.be/rss/section/51dd6a40-e297-409c-af25-9f0301159a1c'), ('Culinair', 'http://www.standaard.be/rss/section/ec1dbffa-a00b-48e6-96f0-00d215f90744'), ('Reizen', 'http://www.standaard.be/rss/section/eed96e23-ed90-4818-83ab-adabf8caf0f4'), ('Design & Wonen', 'http://www.standaard.be/rss/section/f4dd4e8d-6cb1-4eef-abc2-06b0e3d72de4'), ('Gezondheid & Psycho', 'http://www.standaard.be/rss/section/a166bb48-b6b4-4c1a-beb3-9f0301160b75'), ('Glamour', 'http://www.standaard.be/rss/section/06b5429e-beb1-4e76-909c-9f0301162a9c'), ('Lifestyleblog', 'http://www.standaard.be/rss/section/246d27cb-ce7b-4245-bad4-a09f0119b450'), # Weblogs ('Autoblog', 'http://www.standaard.be/rss/tag/autoblog'), ('Beursexperts', 'http://www.standaard.be/rss/tag/beursexperts'), ('En nu even elders', 'http://www.standaard.be/rss/tag/blog-en-nu-even-elders'), ('Marketingblog', 'http://www.standaard.be/rss/tag/marketingblog'), ('TV-blog', 'http://www.standaard.be/rss/tag/tv-blog'), # Interactie ('Opinies', 'http://feeds.feedburner.com/dso-meningen-opinie') ] keep_only_tags = [ dict(name='header', attrs={'class':'article__header'}), dict(name='footer', attrs={'class':'article__meta'}), #dict(name='div', attrs={'class':['article', 'article__body', 'slideshow__intro']}), dict(name='article', attrs={'class':'article-full'}), dict(name='figure', attrs={'class':'article__image'}) ] remove_tags = [ dict(name=['embed', 'object']), dict(name='div', attrs={'class':['note NotePortrait', 'note']}), dict(name='ul', attrs={'class':re.compile('article__share')}), dict(name='div', attrs={'class':'slideshow__controls'}), dict(name='a', attrs={'role':'button'}), dict(name='figure', attrs={'class':re.compile('video')}) ] remove_attributes = ['width', 'height'] def preprocess_html(self, soup): del soup.body['onload'] for item in soup.findAll(style=True): del item['style'] return soup Code:
#dict(name='div', attrs={'class':['article', 'article__body', 'slideshow__intro']}), dict(name='article', attrs={'class':'article-full'}), |
08-05-2016, 08:47 AM | #14 | |
Enthusiast
Posts: 43
Karma: 10
Join Date: Oct 2012
Location: Belgium
Device: Promedia e-reader (Onyx C67ML) - Aldi2016 / Former: Sony PRS-T1&T2
|
Quote:
Koen |
|
07-21-2017, 01:31 PM | #15 |
Junior Member
Posts: 1
Karma: 10
Join Date: Jul 2017
Device: Kobo H2O
|
I just bumped into this thread through google search. Thanks so much for making this code and making ereaders a tad more worthwhile. I have to say I haven't tried it yet, but I have full digital access to De Sandaard and was wondering if there's any way I can download that day's newspaper to read it on my ereader? Probably not without a serious overhaul?
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
All recipes broken for me...? | NSILMike | Recipes | 11 | 06-24-2016 08:45 PM |
Gamasutra recipes broken | tom_a_sparks | Recipes | 8 | 11-11-2015 12:25 PM |
E-reader with Dutch/English or Dutch/Polish dictionary | tttx | Which one should I buy? | 17 | 08-20-2015 05:42 AM |
Dutch: de Volkskrant (subscription) is broken | cnsmr | Recipes | 9 | 07-03-2012 06:31 PM |
Times Of India, DNA recipes broken? | mihirp | Recipes | 1 | 09-23-2011 03:09 PM |