|
|
#1 |
|
Enthusiast
![]() Posts: 43
Karma: 10
Join Date: Oct 2012
Location: Belgium
Device: Promedia e-reader (Onyx C67ML) - Aldi2016 / Former: Sony PRS-T1&T2
|
Belgian-Dutch recipes Broken (for some time)
None of the Belgian (Dutch) built in recipes still work.
As I am a newbie when it comes to recipes, mentioning that they do not work my contribution. Sorry. What I get is: a menu, a paragraph menu, but no artikels. It happens in all Belgian news sources. I haven't checked the Dutch sources yet. | Volgende | Paragraafmenu | Hoofdmenu | This article was downloaded by calibre fromh http://www.gva.be/cnt/dmf20160708_02...haven-zaventem | Paragraafmenu | Hoofdmenu | |
|
|
|
|
|
#2 |
|
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,617
Karma: 28549044
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
I dont maintain recipes for languages I cannot read, as that makes it much harder to understand the website being scraped. So you will have to hope that someone who both reads the language and knows how to code is willing tohelp.
|
|
|
|
| Advert | |
|
|
|
|
#3 | |
|
Enthusiast
![]() Posts: 43
Karma: 10
Join Date: Oct 2012
Location: Belgium
Device: Promedia e-reader (Onyx C67ML) - Aldi2016 / Former: Sony PRS-T1&T2
|
Thanks for the quick reply
Quote:
If nobody comes forward these days I would suggest to delete the recipes. Is someone does come forward i can help, as the Belgian newspaper marked has changed considerbly. Keep on the good work. |
|
|
|
|
|
|
#4 |
|
Bookish
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,049
Karma: 2006208
Join Date: Jun 2011
Device: PC, t1, t2, t3, Clara BW, Clara HD, Libra 2, Libra Color, Nxtpaper 11
|
It would help to mention which ones did not work for you ...
I just tried some Belgian and Dutch recipes and they all work. Yes, they are slow loading, so be patient! |
|
|
|
|
|
#5 |
|
Member
![]() Posts: 17
Karma: 10
Join Date: Apr 2016
Device: Tolino Vision 3HD
|
Well, the downloaded files do not contain any articles, so I would indeed say they are broken. I just had a quick look at the GVA recipe mentioned in the first post. It was easy to fix, but I'll leave the other recipes to someone else.
Update for gva_be.recipe: Code:
#!/usr/bin/env python2
# vim:fileencoding=utf-8
from __future__ import unicode_literals, division, absolute_import, print_function
__license__ = 'GPL v3'
__copyright__ = '2009, Darko Miletic <darko.miletic at gmail.com>'
'''
www.gva.be
'''
import re
from calibre.web.feeds.news import BasicNewsRecipe
class GazetvanAntwerpen(BasicNewsRecipe):
title = 'Gazet van Antwerpen'
__author__ = 'Darko Miletic'
description = 'News from Belgium in Dutch'
publisher = 'Gazet van Antwerpen'
category = 'news, politics, Belgium'
language = 'nl_BE'
oldest_article = 2
max_articles_per_feed = 100
no_stylesheets = True
use_embedded_content = False
remove_javascript = True
masthead_url = 'http://2.gvacdn.be/extra/assets/img/gazet-van-antwerpen-red.svg'
feeds = [
('Stad & Regio', 'http://www.gva.be/syndicationservices/artfeedservice.svc/rss/mostrecent/stadenregio'),
('Economie', 'http://www.gva.be/syndicationservices/artfeedservice.svc/rss/mostrecent/economie'),
('Binnenland', 'http://www.gva.be/syndicationservices/artfeedservice.svc/rss/mostrecent/binnenland'),
('Buitenland', 'http://www.gva.be/syndicationservices/artfeedservice.svc/rss/mostrecent/buitenland'),
('Media & Cultuur', 'http://www.gva.be/syndicationservices/artfeedservice.svc/rss/mostrecent/mediaencultuur'),
('Sport', 'http://www.gva.be/syndicationservices/artfeedservice.svc/rss/mostrecent/sport')
]
keep_only_tags = [
dict(name='header', attrs={'class':'article__header'}),
dict(name='footer', attrs={'class':'article__meta'}),
dict(name='div', attrs={'class':['article', 'article__body', 'slideshow__intro']}),
dict(name='figure', attrs={'class':'article__image'})
]
remove_tags = [
dict(name=['embed', 'object']),
dict(name='div', attrs={'class':['note NotePortrait', 'note']}),
dict(name='ul', attrs={'class':re.compile('article__share')}),
dict(name='div', attrs={'class':'slideshow__controls'}),
dict(name='a', attrs={'role':'button'}),
dict(name='figure', attrs={'class':re.compile('video')})
]
remove_attributes = ['width', 'height']
def preprocess_html(self, soup):
del soup.body['onload']
for item in soup.findAll(style=True):
del item['style']
return soup
|
|
|
|
| Advert | |
|
|
|
|
#6 |
|
Enthusiast
![]() Posts: 43
Karma: 10
Join Date: Oct 2012
Location: Belgium
Device: Promedia e-reader (Onyx C67ML) - Aldi2016 / Former: Sony PRS-T1&T2
|
|
|
|
|
|
|
#7 |
|
Member
![]() Posts: 17
Karma: 10
Join Date: Apr 2016
Device: Tolino Vision 3HD
|
Hi Kunvp,
looking at my changes to the gva_be.recipe will probably not help you very much to understand how to work on other recipes. I removed some obsolete code which makes the change look bigger than it actually was. As far as I can see, all of the Belgian Dutch news sources have a valid table of contents. This means the feed addresses are still correct, but there's something wrong with the extraction of the content. Modifying the keep_only_tags and remove_tags sections should be sufficient in this case. For example, if you look at the demorgen_be.recipe you will find the line: Code:
keep_only_tags = [dict(name='div' , attrs={'class':'art_box2'})]
Code:
keep_only_tags = [dict(name='div' , attrs={'class':'article__wrapper'})]
For an in-depth explanation of recipe programming just have a look at the Calibre documentation: https://manual.calibre-ebook.com/news.html |
|
|
|
|
|
#8 |
|
Enthusiast
![]() Posts: 43
Karma: 10
Join Date: Oct 2012
Location: Belgium
Device: Promedia e-reader (Onyx C67ML) - Aldi2016 / Former: Sony PRS-T1&T2
|
I'll give it a try, but don't expect it by tomorrow.
Thank you Aimylios, your explanation is very useful in starting to understanding the issue.
Back at school, years, decades actually, ago I had to write scripts to convert layout code from pc to hi-end systems. This actually looks kind of similar. I'll give it a try, but don't expect it by tomorrow. :-) |
|
|
|
|
|
#9 | |
|
Member
![]() Posts: 11
Karma: 10
Join Date: Feb 2010
Location: Belgium
Device: Foxit eSlick > Amazon Kindle PW3
|
Quote:
Code:
publisher = 'Gazet van Antwerpen' Code:
publisher = 'Mediahuis' Using the example above I've come up with a recipe for another newspaper from the same publisher; Het Nieuwsblad: Code:
#!/usr/bin/env python2
# vim:fileencoding=utf-8
from __future__ import unicode_literals, division, absolute_import, print_function
from calibre.web.feeds.news import BasicNewsRecipe
class AdvancedUserRecipe1467571059(BasicNewsRecipe):
title = 'Het Nieuwsblad'
__author__ = 'Darko Miletic, Aimylios, oCkz7bJ_'
description = 'Het Nieuwsblad is goed voor u.'
publisher = 'Mediahuis'
category = 'news, politics, Belgium'
language = 'nl_BE'
oldest_article = 2
max_articles_per_feed = 100
no_stylesheets = True
use_embedded_content = False
remove_javascript = True
cover_url = 'http://www.lottocyclingcup.be/lc15/dendermedia/images/details/foto/partners_36/nieuwsblad_1_20160210_1242138359.jpg'
masthead_url = 'http://www.mediahuisconnect.be/uploads/media/5576fa0b83c38/nieuwsblad.svg'
#Source: http://www.nieuwsblad.be/rss
feeds = [
# Nieuws
('Snelnieuws', 'http://feeds.nieuwsblad.be/nieuws/snelnieuws'),
('Binnenland', 'http://feeds.nieuwsblad.be/nieuws/binnenland'),
('Buitenland', 'http://feeds.nieuwsblad.be/nieuwsblad/buitenland'),
# Economie
('Economie', 'http://feeds.nieuwsblad.be/economie/home'),
('Consument', 'http://feeds.nieuwsblad.be/economie/algemeen'),
('Bedrijven', 'http://feeds.nieuwsblad.be/economie/bedrijven'),
('Werk', 'http://feeds.nieuwsblad.be/economie/Werk'),
('Beurs', 'http://feeds.nieuwsblad.be/economie/beurs'),
# Regio
#('0123 Region1', 'http://www.nieuwsblad.be/rss.aspx?intro=1§ion=postcode&postcode=0123'),
#('3456 Region2', 'http://www.nieuwsblad.be/rss.aspx?intro=1§ion=postcode&postcode=3456'),
#('6789 Region3', 'http://www.nieuwsblad.be/rss.aspx?intro=1§ion=postcode&postcode=6789'),
# Sport
('Voetbal', 'http://feeds.nieuwsblad.be/nieuwsblad/sport/voetbal'),
('Wielrennen', 'http://feeds.nieuwsblad.be/nieuwsblad/sport/wielrennen'),
('Tennis', 'http://feeds.nieuwsblad.be/nieuwsblad/sport/tennis'),
('Autosport', 'http://feeds.nieuwsblad.be/nieuwsblad/sport/autosport'),
('Basketbal', 'http://feeds.nieuwsblad.be/nieuwsblad/sport/basketbal'),
('Volleybal', 'http://feeds.nieuwsblad.be/nieuwsblad/sport/volleybal'),
('Atletiek', 'http://feeds.nieuwsblad.be/nieuwsblad/sport/atletiek'),
# Extra
('Film', 'http://feeds.nieuwsblad.be/life/film'),
('Boek', 'http://feeds.nieuwsblad.be/life/boeken'),
('Muziek', 'http://feeds.nieuwsblad.be/life/muziek'),
('Podium', 'http://feeds.nieuwsblad.be/life/podium'),
('TV & Radio', 'http://feeds.nieuwsblad.be/life/tv'),
# She
('BV & Co', 'http://feeds.nieuwsblad.be/life/bv'),
('Mode & Design', 'http://feeds.nieuwsblad.be/life/mode'),
('Culinair', 'http://feeds.nieuwsblad.be/life/culinair'),
('Gezondheid', 'http://feeds.nieuwsblad.be/life/gezondheid'),
('Reizen', 'http://feeds.nieuwsblad.be/life/reizen'),
('Dieren', 'http://feeds.nieuwsblad.be/life/dieren'),
# Weblog
('Surfplank', 'http://nieuwsblad.typepad.com/surfplank/atom.xml'),
('Boeken', 'http://nieuwsblad.typepad.com/boeken/atom.xml'),
('Strips', 'http://nieuwsblad.typepad.com/strips/atom.xml'),
('DVD', 'http://nieuwsblad.typepad.com/dvd/atom.xml'),
('Dierendoktor', 'http://nieuwsblad.typepad.com/dierendokter/atom.xml'),
('Zapdog', 'http://nieuwsblad.typepad.com/zapdog/atom.xml'),
]
keep_only_tags = [
dict(name='header', attrs={'class':'article__header'}),
dict(name='footer', attrs={'class':'article__meta'}),
dict(name='div', attrs={'class':['article', 'article__body', 'slideshow__intro']}),
dict(name='figure', attrs={'class':'article__image'})
]
remove_tags = [
dict(name=['embed', 'object']),
dict(name='div', attrs={'class':['note NotePortrait', 'note']}),
dict(name='ul', attrs={'class':re.compile('article__share')}),
dict(name='div', attrs={'class':'slideshow__controls'}),
dict(name='a', attrs={'role':'button'}),
dict(name='figure', attrs={'class':re.compile('video')})
]
remove_attributes = ['width', 'height']
def preprocess_html(self, soup):
del soup.body['onload']
for item in soup.findAll(style=True):
del item['style']
return soup
Last edited by oCkz7bJ_; 07-31-2016 at 09:06 AM. Reason: Alignment |
|
|
|
|
|
|
#10 |
|
Member
![]() Posts: 11
Karma: 10
Join Date: Feb 2010
Location: Belgium
Device: Foxit eSlick > Amazon Kindle PW3
|
Here's a fairly simple one for DataNews:
Code:
#!/usr/bin/env python2
# vim:fileencoding=utf-8
from __future__ import unicode_literals, division, absolute_import, print_function
from calibre.web.feeds.news import BasicNewsRecipe
class AdvancedUserRecipe1468055030(BasicNewsRecipe):
title = 'DataNews'
__author__ = 'oCkz7bJ_'
description = 'Technology / Best Practice / Business'
publisher = 'Roularta Media Group'
category = 'news, information technology, Belgium'
language = 'nl_BE'
oldest_article = 2
max_articles_per_feed = 100
auto_cleanup = True
no_stylesheets = True
use_embedded_content = False
remove_javascript = True
cover_url = 'http://datablend.be/wp-content/uploads/2014/01/Data_News_logo-short.jpg'
masthead_url = 'http://datanews.knack.be/images/svg/logos/logo_Site-DataNews-NL.svg'
# Source: http://datanews.knack.be/rss/
feeds = [
('Technology', 'http://datanews.knack.be/ict/feed.rss'),
('Opinie', 'http://datanews.knack.be/ict/opinie/feed.rss'),
('Gadgets', 'http://datanews.knack.be/ict/gadgets/feed.rss'),
('Foto', 'http://datanews.knack.be/ict/foto/feed.rss'),
('Nieuws', 'http://datanews.knack.be/ict/nieuws/feed.rss'),
('Reviews', 'http://datanews.knack.be/ict/reviews/feed.rss'),
('Startups', 'http://datanews.knack.be/ict/start-ups/feed.rss'),
]
Last edited by oCkz7bJ_; 07-28-2016 at 08:37 AM. Reason: added link to datanews |
|
|
|
|
|
#11 | |
|
Enthusiast
![]() Posts: 43
Karma: 10
Join Date: Oct 2012
Location: Belgium
Device: Promedia e-reader (Onyx C67ML) - Aldi2016 / Former: Sony PRS-T1&T2
|
Quote:
Dear oCkz7bJ_, Thank you so much for having a look at this. I have too little knowledge to make clean scripts. To answer your question straightforward: I wouldn't go for the title 'Mediahuis' as users in Belgium don't recognise this as a news source. Unless, you don't want to change the title of the sourde. Because it is correct that the publisher is Mediahuis". Mediahuis (°2013) is a joint venture of two publishers (newspaper and onlinenews site). (wikiperdia) They run the following sites: 1. Het Nieuwsblad De Gentenaar" 2. Gazet Van Antwerpen 3. Het belang van Limburg The different titles have different regional content. All this titles have a strong focus on regional contend (in the paper edition) This is reflected on there websites. De Standaard is the so called quality newspaper/branch of the group. They use a lot of content from the above mentioned Het Nieuwsblad, but they have more Editorials, opinon etc. All Belgian news sources have decreased the lenght and number of FREE articles. To summarise: - Mediahuis will not be recognised by users. It is not a problem when the this does niot influence the title visible to users. - It can be expected that the scripts of different titles will be fairly identical. - The regional content should be different Does all this answer your question? I have the feeling I made it all too complex. :-p Many greetings, Koen Last edited by Kunvp; 07-29-2016 at 05:40 AM. Reason: I did'nt understand the question. :-p |
|
|
|
|
|
|
#12 | |||
|
Member
![]() Posts: 11
Karma: 10
Join Date: Feb 2010
Location: Belgium
Device: Foxit eSlick > Amazon Kindle PW3
|
Quote:
Code:
title = 'Gazet van Antwerpen'
publisher = 'Gazet van Antwerpen'
Code:
title = 'Gazet van Antwerpen'
publisher = 'Mediahuis'
Quote:
Quote:
I suspect even a digital subscription will not provide a full content RSS feed, it's either app (iOS & Android only) or website. I'll try asking around "via via". I'd consider a subscription if they would publish their newspaper in proper ebook format. (None of the Belgian publishers do AFAIK). |
|||
|
|
|
|
|
#13 |
|
Member
![]() Posts: 11
Karma: 10
Join Date: Feb 2010
Location: Belgium
Device: Foxit eSlick > Amazon Kindle PW3
|
Here's a recipe that seems to work for De Standaard:
Code:
#!/usr/bin/env python2
# vim:fileencoding=utf-8
from __future__ import unicode_literals, division, absolute_import, print_function
from calibre.web.feeds.news import BasicNewsRecipe
class AdvancedUserRecipe1467571059(BasicNewsRecipe):
title = 'De Standaard'
__author__ = 'Darko Miletic, Aimylios, oCkz7bJ_'
description = 'De Standaard'
publisher = 'Mediahuis'
category = 'news, politics, Belgium'
language = 'nl_BE'
oldest_article = 2
max_articles_per_feed = 100
no_stylesheets = True
use_embedded_content = False
remove_javascript = True
cover_url = 'http://www.standaard.be/extra/assets/extra/dslive/headers/ds-black.svg'
masthead_url = 'http://tonysweb.be/m/img/tijdschriften/de_standaard.svg'
#Source: http://www.standaard.be/rssfeeds
feeds = [
# Nieuws
('Binnenland', 'http://www.standaard.be/rss/section/1f2838d4-99ea-49f0-9102-138784c7ea7c'),
('Buitenland', 'http://www.standaard.be/rss/section/e70ccf13-a2f0-42b0-8bd3-e32d424a0aa0'),
('Cultuur', 'http://www.standaard.be/rss/section/ab8d3fd8-bf2f-487a-818b-9ea546e9a859'),
('Media', 'http://www.standaard.be/rss/section/eb1a6433-ca3f-4a3b-ab48-a81a5fb8f6e2'),
('Economie', 'http://www.standaard.be/rss/section/451c8e1e-f9e4-450e-aa1f-341eab6742cc'),
('Sport', 'http://www.standaard.be/rss/section/8f693cea-dba8-46e4-8575-807d1dc2bcb7'),
('Beroemd en Bizar', 'http://www.standaard.be/rss/section/113a9a78-f65a-47a8-bd1c-b24483321d0f'),
# Standaard.biz
('Overzicht', 'http://www.standaard.be/rss/section/a30afc42-3737-4301-8f8a-5b6833855457'),
('Economie', 'http://www.standaard.be/rss/section/212b8b54-bd91-4c8b-942c-8029e8797d36'),
('Bedrijven', 'http://www.standaard.be/rss/section/6aa8d4fa-4b9a-40d5-aa8f-87ac72472f27'),
('Consument', 'http://www.standaard.be/rss/section/46025691-2ec4-4a06-b6d7-9773686a24a7'),
('Beurs', 'http://www.standaard.be/rss/section/74cef9d1-3b28-4b90-943a-ce685bf6ed6e'),
('Marketing & Media', 'http://www.standaard.be/rss/section/9bdf4a14-f8bf-4439-aaf1-344181f73e73'),
('Mobilia', 'http://www.standaard.be/rss/section/270b7f8f-dd73-44cb-b622-9f7200a439a7'),
# Lifestyle
('Mode', 'http://www.standaard.be/rss/section/3a4b39a1-e58f-42e4-8ae9-a0f90f97f27f'),
('Beauty', 'http://www.standaard.be/rss/section/51dd6a40-e297-409c-af25-9f0301159a1c'),
('Culinair', 'http://www.standaard.be/rss/section/ec1dbffa-a00b-48e6-96f0-00d215f90744'),
('Reizen', 'http://www.standaard.be/rss/section/eed96e23-ed90-4818-83ab-adabf8caf0f4'),
('Design & Wonen', 'http://www.standaard.be/rss/section/f4dd4e8d-6cb1-4eef-abc2-06b0e3d72de4'),
('Gezondheid & Psycho', 'http://www.standaard.be/rss/section/a166bb48-b6b4-4c1a-beb3-9f0301160b75'),
('Glamour', 'http://www.standaard.be/rss/section/06b5429e-beb1-4e76-909c-9f0301162a9c'),
('Lifestyleblog', 'http://www.standaard.be/rss/section/246d27cb-ce7b-4245-bad4-a09f0119b450'),
# Weblogs
('Autoblog', 'http://www.standaard.be/rss/tag/autoblog'),
('Beursexperts', 'http://www.standaard.be/rss/tag/beursexperts'),
('En nu even elders', 'http://www.standaard.be/rss/tag/blog-en-nu-even-elders'),
('Marketingblog', 'http://www.standaard.be/rss/tag/marketingblog'),
('TV-blog', 'http://www.standaard.be/rss/tag/tv-blog'),
# Interactie
('Opinies', 'http://feeds.feedburner.com/dso-meningen-opinie')
]
keep_only_tags = [
dict(name='header', attrs={'class':'article__header'}),
dict(name='footer', attrs={'class':'article__meta'}),
#dict(name='div', attrs={'class':['article', 'article__body', 'slideshow__intro']}),
dict(name='article', attrs={'class':'article-full'}),
dict(name='figure', attrs={'class':'article__image'})
]
remove_tags = [
dict(name=['embed', 'object']),
dict(name='div', attrs={'class':['note NotePortrait', 'note']}),
dict(name='ul', attrs={'class':re.compile('article__share')}),
dict(name='div', attrs={'class':'slideshow__controls'}),
dict(name='a', attrs={'role':'button'}),
dict(name='figure', attrs={'class':re.compile('video')})
]
remove_attributes = ['width', 'height']
def preprocess_html(self, soup):
del soup.body['onload']
for item in soup.findAll(style=True):
del item['style']
return soup
Code:
#dict(name='div', attrs={'class':['article', 'article__body', 'slideshow__intro']}),
dict(name='article', attrs={'class':'article-full'}),
|
|
|
|
|
|
#14 | |
|
Enthusiast
![]() Posts: 43
Karma: 10
Join Date: Oct 2012
Location: Belgium
Device: Promedia e-reader (Onyx C67ML) - Aldi2016 / Former: Sony PRS-T1&T2
|
Quote:
Koen |
|
|
|
|
|
|
#15 |
|
Junior Member
![]() Posts: 1
Karma: 10
Join Date: Jul 2017
Device: Kobo H2O
|
I just bumped into this thread through google search. Thanks so much for making this code and making ereaders a tad more worthwhile. I have to say I haven't tried it yet, but I have full digital access to De Sandaard and was wondering if there's any way I can download that day's newspaper to read it on my ereader? Probably not without a serious overhaul?
|
|
|
|
![]() |
| Thread Tools | Search this Thread |
|
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| All recipes broken for me...? | NSILMike | Recipes | 11 | 06-24-2016 09:45 PM |
| Gamasutra recipes broken | tom_a_sparks | Recipes | 8 | 11-11-2015 01:25 PM |
| E-reader with Dutch/English or Dutch/Polish dictionary | tttx | Which one should I buy? | 17 | 08-20-2015 06:42 AM |
| Dutch: de Volkskrant (subscription) is broken | cnsmr | Recipes | 9 | 07-03-2012 07:31 PM |
| Times Of India, DNA recipes broken? | mihirp | Recipes | 1 | 09-23-2011 04:09 PM |