Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 07-08-2016, 07:42 AM   #1
Kunvp
Enthusiast
Kunvp began at the beginning.
 
Kunvp's Avatar
 
Posts: 43
Karma: 10
Join Date: Oct 2012
Location: Belgium
Device: Promedia e-reader (Onyx C67ML) - Aldi2016 / Former: Sony PRS-T1&T2
Belgian-Dutch recipes Broken (for some time)

None of the Belgian (Dutch) built in recipes still work.

As I am a newbie when it comes to recipes,
mentioning that they do not work my contribution. Sorry.

What I get is: a menu, a paragraph menu, but no artikels.
It happens in all Belgian news sources. I haven't checked the Dutch sources yet.


| Volgende | Paragraafmenu | Hoofdmenu |
This article was downloaded by calibre fromh
http://www.gva.be/cnt/dmf20160708_02...haven-zaventem
| Paragraafmenu | Hoofdmenu |
Kunvp is offline   Reply With Quote
Old 07-08-2016, 08:27 AM   #2
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,842
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
I dont maintain recipes for languages I cannot read, as that makes it much harder to understand the website being scraped. So you will have to hope that someone who both reads the language and knows how to code is willing tohelp.
kovidgoyal is offline   Reply With Quote
Old 07-08-2016, 09:03 AM   #3
Kunvp
Enthusiast
Kunvp began at the beginning.
 
Kunvp's Avatar
 
Posts: 43
Karma: 10
Join Date: Oct 2012
Location: Belgium
Device: Promedia e-reader (Onyx C67ML) - Aldi2016 / Former: Sony PRS-T1&T2
Thanks for the quick reply

Quote:
Originally Posted by kovidgoyal View Post
I dont maintain recipes for languages I cannot read, as that makes it much harder to understand the website being scraped. So you will have to hope that someone who both reads the language and knows how to code is willing tohelp.
Thank you for the quick reply.
If nobody comes forward these days I would suggest to delete the recipes.
Is someone does come forward i can help, as the Belgian newspaper marked has changed considerbly.

Keep on the good work.
Kunvp is offline   Reply With Quote
Old 07-08-2016, 01:45 PM   #4
DrChiper
Bookish
DrChiper ought to be getting tired of karma fortunes by now.DrChiper ought to be getting tired of karma fortunes by now.DrChiper ought to be getting tired of karma fortunes by now.DrChiper ought to be getting tired of karma fortunes by now.DrChiper ought to be getting tired of karma fortunes by now.DrChiper ought to be getting tired of karma fortunes by now.DrChiper ought to be getting tired of karma fortunes by now.DrChiper ought to be getting tired of karma fortunes by now.DrChiper ought to be getting tired of karma fortunes by now.DrChiper ought to be getting tired of karma fortunes by now.DrChiper ought to be getting tired of karma fortunes by now.
 
DrChiper's Avatar
 
Posts: 907
Karma: 1803094
Join Date: Jun 2011
Device: PC, t1, t2, t3, aura 2 v1, clara HD, Libra 2, Nxtpaper 11
It would help to mention which ones did not work for you ...
I just tried some Belgian and Dutch recipes and they all work.
Yes, they are slow loading, so be patient!
Attached Thumbnails
Click image for larger version

Name:	calibre.jpg
Views:	284
Size:	91.5 KB
ID:	150049  
DrChiper is offline   Reply With Quote
Old 07-08-2016, 04:17 PM   #5
Aimylios
Member
Aimylios began at the beginning.
 
Posts: 16
Karma: 10
Join Date: Apr 2016
Device: Tolino Vision 3HD
Well, the downloaded files do not contain any articles, so I would indeed say they are broken. I just had a quick look at the GVA recipe mentioned in the first post. It was easy to fix, but I'll leave the other recipes to someone else.

Update for gva_be.recipe:
Code:
#!/usr/bin/env python2
# vim:fileencoding=utf-8
from __future__ import unicode_literals, division, absolute_import, print_function

__license__   = 'GPL v3'
__copyright__ = '2009, Darko Miletic <darko.miletic at gmail.com>'

'''
www.gva.be
'''

import re
from calibre.web.feeds.news import BasicNewsRecipe

class GazetvanAntwerpen(BasicNewsRecipe):
    title                 = 'Gazet van Antwerpen'
    __author__            = 'Darko Miletic'
    description           = 'News from Belgium in Dutch'
    publisher             = 'Gazet van Antwerpen'
    category              = 'news, politics, Belgium'
    language              = 'nl_BE'

    oldest_article        = 2
    max_articles_per_feed = 100
    no_stylesheets        = True
    use_embedded_content  = False
    remove_javascript     = True

    masthead_url = 'http://2.gvacdn.be/extra/assets/img/gazet-van-antwerpen-red.svg'

    feeds = [
        ('Stad & Regio', 'http://www.gva.be/syndicationservices/artfeedservice.svc/rss/mostrecent/stadenregio'),
        ('Economie', 'http://www.gva.be/syndicationservices/artfeedservice.svc/rss/mostrecent/economie'),
        ('Binnenland', 'http://www.gva.be/syndicationservices/artfeedservice.svc/rss/mostrecent/binnenland'),
        ('Buitenland', 'http://www.gva.be/syndicationservices/artfeedservice.svc/rss/mostrecent/buitenland'),
        ('Media & Cultuur', 'http://www.gva.be/syndicationservices/artfeedservice.svc/rss/mostrecent/mediaencultuur'),
        ('Sport', 'http://www.gva.be/syndicationservices/artfeedservice.svc/rss/mostrecent/sport')
    ]

    keep_only_tags = [
        dict(name='header', attrs={'class':'article__header'}),
        dict(name='footer', attrs={'class':'article__meta'}),
        dict(name='div', attrs={'class':['article', 'article__body', 'slideshow__intro']}),
        dict(name='figure', attrs={'class':'article__image'})
    ]

    remove_tags = [
        dict(name=['embed', 'object']),
        dict(name='div', attrs={'class':['note NotePortrait', 'note']}),
        dict(name='ul', attrs={'class':re.compile('article__share')}),
        dict(name='div', attrs={'class':'slideshow__controls'}),
        dict(name='a', attrs={'role':'button'}),
        dict(name='figure', attrs={'class':re.compile('video')})
    ]

    remove_attributes = ['width', 'height']

    def preprocess_html(self, soup):
        del soup.body['onload']
        for item in soup.findAll(style=True):
            del item['style']
        return soup
Aimylios is offline   Reply With Quote
Old 07-09-2016, 06:37 PM   #6
Kunvp
Enthusiast
Kunvp began at the beginning.
 
Kunvp's Avatar
 
Posts: 43
Karma: 10
Join Date: Oct 2012
Location: Belgium
Device: Promedia e-reader (Onyx C67ML) - Aldi2016 / Former: Sony PRS-T1&T2
Quote:
Originally Posted by Aimylios View Post
It was easy to fix,
[/CODE]
Thank you Amylios.

I'll try to understand what you have doen, but you may give me a hint too.
If I manage to understand, I have a look at the others.
Kunvp is offline   Reply With Quote
Old 07-10-2016, 04:35 AM   #7
Aimylios
Member
Aimylios began at the beginning.
 
Posts: 16
Karma: 10
Join Date: Apr 2016
Device: Tolino Vision 3HD
Hi Kunvp,

looking at my changes to the gva_be.recipe will probably not help you very much to understand how to work on other recipes. I removed some obsolete code which makes the change look bigger than it actually was.

As far as I can see, all of the Belgian Dutch news sources have a valid table of contents. This means the feed addresses are still correct, but there's something wrong with the extraction of the content. Modifying the keep_only_tags and remove_tags sections should be sufficient in this case.
For example, if you look at the demorgen_be.recipe you will find the line:
Code:
    keep_only_tags = [dict(name='div' , attrs={'class':'art_box2'})]
which means that Calibre expects the content to be wrapped into an html tag like <div class="art_box2">...</div>. But if you look at the source code of an arbitrary article (picture attached) you will see that the relevant tag is <div class="article__wrapper">...</div>. By changing the line above to:
Code:
    keep_only_tags = [dict(name='div' , attrs={'class':'article__wrapper'})]
you should get a working recipe (didn't try it myself).

For an in-depth explanation of recipe programming just have a look at the Calibre documentation:
https://manual.calibre-ebook.com/news.html
Attached Thumbnails
Click image for larger version

Name:	example_demorgen.jpg
Views:	247
Size:	220.1 KB
ID:	150096  
Aimylios is offline   Reply With Quote
Old 07-11-2016, 05:28 AM   #8
Kunvp
Enthusiast
Kunvp began at the beginning.
 
Kunvp's Avatar
 
Posts: 43
Karma: 10
Join Date: Oct 2012
Location: Belgium
Device: Promedia e-reader (Onyx C67ML) - Aldi2016 / Former: Sony PRS-T1&T2
I'll give it a try, but don't expect it by tomorrow.

Thank you Aimylios, your explanation is very useful in starting to understanding the issue.
Back at school, years, decades actually, ago I had to write scripts to convert layout code from pc to hi-end systems. This actually looks kind of similar.

I'll give it a try, but don't expect it by tomorrow.
:-)
Kunvp is offline   Reply With Quote
Old 07-28-2016, 05:11 AM   #9
oCkz7bJ_
Member
oCkz7bJ_ began at the beginning.
 
Posts: 11
Karma: 10
Join Date: Feb 2010
Location: Belgium
Device: Foxit eSlick > Amazon Kindle PW3
Quote:
Originally Posted by Aimylios View Post
Update for gva_be.recipe
May I suggest to correct
Code:
    publisher             = 'Gazet van Antwerpen'
to
Code:
    publisher             = 'Mediahuis'
?

Using the example above I've come up with a recipe for another newspaper from the same publisher; Het Nieuwsblad:
Code:
#!/usr/bin/env python2
# vim:fileencoding=utf-8
from __future__ import unicode_literals, division, absolute_import, print_function
from calibre.web.feeds.news import BasicNewsRecipe

class AdvancedUserRecipe1467571059(BasicNewsRecipe):
    title                 = 'Het Nieuwsblad'
    __author__            = 'Darko Miletic, Aimylios, oCkz7bJ_'
    description           = 'Het Nieuwsblad is goed voor u.'
    publisher             = 'Mediahuis'
    category              = 'news, politics, Belgium'
    language              = 'nl_BE'

    oldest_article        = 2
    max_articles_per_feed = 100
    no_stylesheets        = True
    use_embedded_content  = False
    remove_javascript     = True
    
    cover_url    = 'http://www.lottocyclingcup.be/lc15/dendermedia/images/details/foto/partners_36/nieuwsblad_1_20160210_1242138359.jpg'
    masthead_url = 'http://www.mediahuisconnect.be/uploads/media/5576fa0b83c38/nieuwsblad.svg'

    #Source: http://www.nieuwsblad.be/rss
    feeds          = [
	# Nieuws
        ('Snelnieuws', 'http://feeds.nieuwsblad.be/nieuws/snelnieuws'),
        ('Binnenland', 'http://feeds.nieuwsblad.be/nieuws/binnenland'),
        ('Buitenland', 'http://feeds.nieuwsblad.be/nieuwsblad/buitenland'),
	# Economie
        ('Economie', 'http://feeds.nieuwsblad.be/economie/home'),
        ('Consument', 'http://feeds.nieuwsblad.be/economie/algemeen'),
        ('Bedrijven', 'http://feeds.nieuwsblad.be/economie/bedrijven'),
        ('Werk', 'http://feeds.nieuwsblad.be/economie/Werk'),
        ('Beurs', 'http://feeds.nieuwsblad.be/economie/beurs'),
        # Regio
        #('0123 Region1', 'http://www.nieuwsblad.be/rss.aspx?intro=1&section=postcode&postcode=0123'),
        #('3456 Region2', 'http://www.nieuwsblad.be/rss.aspx?intro=1&section=postcode&postcode=3456'),
        #('6789 Region3', 'http://www.nieuwsblad.be/rss.aspx?intro=1&section=postcode&postcode=6789'),
	# Sport
        ('Voetbal', 'http://feeds.nieuwsblad.be/nieuwsblad/sport/voetbal'),
        ('Wielrennen', 'http://feeds.nieuwsblad.be/nieuwsblad/sport/wielrennen'),
        ('Tennis', 'http://feeds.nieuwsblad.be/nieuwsblad/sport/tennis'),
        ('Autosport', 'http://feeds.nieuwsblad.be/nieuwsblad/sport/autosport'),
        ('Basketbal', 'http://feeds.nieuwsblad.be/nieuwsblad/sport/basketbal'),
        ('Volleybal', 'http://feeds.nieuwsblad.be/nieuwsblad/sport/volleybal'),
        ('Atletiek', 'http://feeds.nieuwsblad.be/nieuwsblad/sport/atletiek'),
	# Extra
        ('Film', 'http://feeds.nieuwsblad.be/life/film'),
        ('Boek', 'http://feeds.nieuwsblad.be/life/boeken'),
        ('Muziek', 'http://feeds.nieuwsblad.be/life/muziek'),
        ('Podium', 'http://feeds.nieuwsblad.be/life/podium'),
        ('TV & Radio', 'http://feeds.nieuwsblad.be/life/tv'),
	# She
        ('BV & Co', 'http://feeds.nieuwsblad.be/life/bv'),
        ('Mode & Design', 'http://feeds.nieuwsblad.be/life/mode'),
        ('Culinair', 'http://feeds.nieuwsblad.be/life/culinair'),
        ('Gezondheid', 'http://feeds.nieuwsblad.be/life/gezondheid'),
        ('Reizen', 'http://feeds.nieuwsblad.be/life/reizen'),
        ('Dieren', 'http://feeds.nieuwsblad.be/life/dieren'),
	# Weblog
        ('Surfplank', 'http://nieuwsblad.typepad.com/surfplank/atom.xml'),
        ('Boeken', 'http://nieuwsblad.typepad.com/boeken/atom.xml'),
        ('Strips', 'http://nieuwsblad.typepad.com/strips/atom.xml'),
        ('DVD', 'http://nieuwsblad.typepad.com/dvd/atom.xml'),
        ('Dierendoktor', 'http://nieuwsblad.typepad.com/dierendokter/atom.xml'),
        ('Zapdog', 'http://nieuwsblad.typepad.com/zapdog/atom.xml'),
	]
    
    keep_only_tags = [
        dict(name='header', attrs={'class':'article__header'}),
        dict(name='footer', attrs={'class':'article__meta'}),
        dict(name='div', attrs={'class':['article', 'article__body', 'slideshow__intro']}),
        dict(name='figure', attrs={'class':'article__image'})
    ]

    remove_tags = [
        dict(name=['embed', 'object']),
        dict(name='div', attrs={'class':['note NotePortrait', 'note']}),
        dict(name='ul', attrs={'class':re.compile('article__share')}),
        dict(name='div', attrs={'class':'slideshow__controls'}),
        dict(name='a', attrs={'role':'button'}),
        dict(name='figure', attrs={'class':re.compile('video')})
    ]

    remove_attributes = ['width', 'height']

    def preprocess_html(self, soup):
        del soup.body['onload']
        for item in soup.findAll(style=True):
            del item['style']
        return soup
Note: under "# regio"; only one out of the three postal codes I'm interested in seems to generate some content while the rss feeds do exist. Still need to figure out a solution for that. (The ones in the recipe above are fake placeholders.)

Last edited by oCkz7bJ_; 07-31-2016 at 08:06 AM. Reason: Alignment
oCkz7bJ_ is offline   Reply With Quote
Old 07-28-2016, 07:36 AM   #10
oCkz7bJ_
Member
oCkz7bJ_ began at the beginning.
 
Posts: 11
Karma: 10
Join Date: Feb 2010
Location: Belgium
Device: Foxit eSlick > Amazon Kindle PW3
Here's a fairly simple one for DataNews:
Code:
#!/usr/bin/env python2
# vim:fileencoding=utf-8
from __future__ import unicode_literals, division, absolute_import, print_function
from calibre.web.feeds.news import BasicNewsRecipe

class AdvancedUserRecipe1468055030(BasicNewsRecipe):
    title          		  = 'DataNews'
    __author__            = 'oCkz7bJ_'
    description           = 'Technology / Best Practice / Business'
    publisher             = 'Roularta Media Group'   
    category              = 'news, information technology, Belgium'
    language              = 'nl_BE'

    oldest_article        = 2
    max_articles_per_feed = 100
    auto_cleanup   		  = True
    no_stylesheets        = True
    use_embedded_content  = False
    remove_javascript     = True
    
    cover_url    = 'http://datablend.be/wp-content/uploads/2014/01/Data_News_logo-short.jpg'
    masthead_url = 'http://datanews.knack.be/images/svg/logos/logo_Site-DataNews-NL.svg'

    # Source: http://datanews.knack.be/rss/
    feeds          = [
        ('Technology', 'http://datanews.knack.be/ict/feed.rss'),
        ('Opinie', 'http://datanews.knack.be/ict/opinie/feed.rss'),
        ('Gadgets', 'http://datanews.knack.be/ict/gadgets/feed.rss'),
        ('Foto', 'http://datanews.knack.be/ict/foto/feed.rss'),
        ('Nieuws', 'http://datanews.knack.be/ict/nieuws/feed.rss'),
        ('Reviews', 'http://datanews.knack.be/ict/reviews/feed.rss'),
        ('Startups', 'http://datanews.knack.be/ict/start-ups/feed.rss'),
    ]

Last edited by oCkz7bJ_; 07-28-2016 at 07:37 AM. Reason: added link to datanews
oCkz7bJ_ is offline   Reply With Quote
Old 07-29-2016, 04:37 AM   #11
Kunvp
Enthusiast
Kunvp began at the beginning.
 
Kunvp's Avatar
 
Posts: 43
Karma: 10
Join Date: Oct 2012
Location: Belgium
Device: Promedia e-reader (Onyx C67ML) - Aldi2016 / Former: Sony PRS-T1&T2
Smile Mediahuis - some words on

Quote:
Originally Posted by oCkz7bJ_ View Post
May I suggest to correct
Code:
    publisher             = 'Gazet van Antwerpen'
to
Code:
    publisher             = 'Mediahuis'
?

Dear oCkz7bJ_,

Thank you so much for having a look at this. I have too little knowledge to make clean scripts.

To answer your question straightforward: I wouldn't go for the title 'Mediahuis' as users in Belgium don't recognise this as a news source.
Unless, you don't want to change the title of the sourde. Because it is correct that the publisher is Mediahuis".
Mediahuis (°2013) is a joint venture of two publishers (newspaper and onlinenews site). (wikiperdia)

They run the following sites:
1. Het Nieuwsblad
De Gentenaar"
2. Gazet Van Antwerpen
3. Het belang van Limburg

The different titles have different regional content. All this titles have a strong focus on regional contend (in the paper edition)
This is reflected on there websites.

De Standaard is the so called quality newspaper/branch of the group.
They use a lot of content from the above mentioned Het Nieuwsblad, but they have more Editorials, opinon etc.

All Belgian news sources have decreased the lenght and number of FREE articles.

To summarise:
- Mediahuis will not be recognised by users. It is not a problem when the this does niot influence the title visible to users.
- It can be expected that the scripts of different titles will be fairly identical.
- The regional content should be different

Does all this answer your question?
I have the feeling I made it all too complex. :-p

Many greetings,
Koen

Last edited by Kunvp; 07-29-2016 at 04:40 AM. Reason: I did'nt understand the question. :-p
Kunvp is offline   Reply With Quote
Old 07-29-2016, 05:51 AM   #12
oCkz7bJ_
Member
oCkz7bJ_ began at the beginning.
 
Posts: 11
Karma: 10
Join Date: Feb 2010
Location: Belgium
Device: Foxit eSlick > Amazon Kindle PW3
Quote:
Originally Posted by Kunvp View Post
Unless, you don't want to change the title of the sourde. Because it is correct that the publisher is Mediahuis
That is exactly what I propose:
Code:
    title                 = 'Gazet van Antwerpen'
    publisher             = 'Gazet van Antwerpen'
should be
Code:
    title                 = 'Gazet van Antwerpen'
    publisher             = 'Mediahuis'
Quote:
Originally Posted by Kunvp View Post
Mediahuis (°2013) is a joint venture of two publishers (newspaper and onlinenews site). (wikipedia)
I'm from the same country as you are ;-)

Quote:
Originally Posted by Kunvp View Post
De Standaard is the so called quality newspaper/branch of the group.
They use a lot of content from the above mentioned Het Nieuwsblad, but they have more Editorials, opinon etc.

All Belgian news sources have decreased the lenght and number of FREE articles.
I'll work on a recipe for "De Standaard", the backend for all of these publications is the same so it's fairly easy. Give me a couple of days, I prefer to test it my self first.

I suspect even a digital subscription will not provide a full content RSS feed, it's either app (iOS & Android only) or website. I'll try asking around "via via". I'd consider a subscription if they would publish their newspaper in proper ebook format. (None of the Belgian publishers do AFAIK).
oCkz7bJ_ is offline   Reply With Quote
Old 08-04-2016, 02:44 PM   #13
oCkz7bJ_
Member
oCkz7bJ_ began at the beginning.
 
Posts: 11
Karma: 10
Join Date: Feb 2010
Location: Belgium
Device: Foxit eSlick > Amazon Kindle PW3
Here's a recipe that seems to work for De Standaard:
Code:
#!/usr/bin/env python2
# vim:fileencoding=utf-8
from __future__ import unicode_literals, division, absolute_import, print_function
from calibre.web.feeds.news import BasicNewsRecipe

class AdvancedUserRecipe1467571059(BasicNewsRecipe):
    title                 = 'De Standaard'
    __author__            = 'Darko Miletic, Aimylios, oCkz7bJ_'
    description           = 'De Standaard'
    publisher             = 'Mediahuis'
    category              = 'news, politics, Belgium'
    language              = 'nl_BE'

    oldest_article        = 2
    max_articles_per_feed = 100
    no_stylesheets        = True
    use_embedded_content  = False
    remove_javascript     = True
    
    cover_url    = 'http://www.standaard.be/extra/assets/extra/dslive/headers/ds-black.svg'
    masthead_url = 'http://tonysweb.be/m/img/tijdschriften/de_standaard.svg'

    #Source: http://www.standaard.be/rssfeeds
    feeds          = [
        # Nieuws
        ('Binnenland', 'http://www.standaard.be/rss/section/1f2838d4-99ea-49f0-9102-138784c7ea7c'),
        ('Buitenland', 'http://www.standaard.be/rss/section/e70ccf13-a2f0-42b0-8bd3-e32d424a0aa0'),
        ('Cultuur', 'http://www.standaard.be/rss/section/ab8d3fd8-bf2f-487a-818b-9ea546e9a859'),
        ('Media', 'http://www.standaard.be/rss/section/eb1a6433-ca3f-4a3b-ab48-a81a5fb8f6e2'),
        ('Economie', 'http://www.standaard.be/rss/section/451c8e1e-f9e4-450e-aa1f-341eab6742cc'),
        ('Sport', 'http://www.standaard.be/rss/section/8f693cea-dba8-46e4-8575-807d1dc2bcb7'),
        ('Beroemd en Bizar', 'http://www.standaard.be/rss/section/113a9a78-f65a-47a8-bd1c-b24483321d0f'),
        # Standaard.biz
        ('Overzicht', 'http://www.standaard.be/rss/section/a30afc42-3737-4301-8f8a-5b6833855457'),
        ('Economie', 'http://www.standaard.be/rss/section/212b8b54-bd91-4c8b-942c-8029e8797d36'),
        ('Bedrijven', 'http://www.standaard.be/rss/section/6aa8d4fa-4b9a-40d5-aa8f-87ac72472f27'),
        ('Consument', 'http://www.standaard.be/rss/section/46025691-2ec4-4a06-b6d7-9773686a24a7'),
        ('Beurs', 'http://www.standaard.be/rss/section/74cef9d1-3b28-4b90-943a-ce685bf6ed6e'),
        ('Marketing & Media', 'http://www.standaard.be/rss/section/9bdf4a14-f8bf-4439-aaf1-344181f73e73'),
        ('Mobilia', 'http://www.standaard.be/rss/section/270b7f8f-dd73-44cb-b622-9f7200a439a7'),
        # Lifestyle
        ('Mode', 'http://www.standaard.be/rss/section/3a4b39a1-e58f-42e4-8ae9-a0f90f97f27f'),
        ('Beauty', 'http://www.standaard.be/rss/section/51dd6a40-e297-409c-af25-9f0301159a1c'),
        ('Culinair', 'http://www.standaard.be/rss/section/ec1dbffa-a00b-48e6-96f0-00d215f90744'),
        ('Reizen', 'http://www.standaard.be/rss/section/eed96e23-ed90-4818-83ab-adabf8caf0f4'),
        ('Design & Wonen', 'http://www.standaard.be/rss/section/f4dd4e8d-6cb1-4eef-abc2-06b0e3d72de4'),
        ('Gezondheid & Psycho', 'http://www.standaard.be/rss/section/a166bb48-b6b4-4c1a-beb3-9f0301160b75'),
        ('Glamour', 'http://www.standaard.be/rss/section/06b5429e-beb1-4e76-909c-9f0301162a9c'),
        ('Lifestyleblog', 'http://www.standaard.be/rss/section/246d27cb-ce7b-4245-bad4-a09f0119b450'),
        # Weblogs
        ('Autoblog', 'http://www.standaard.be/rss/tag/autoblog'),
        ('Beursexperts', 'http://www.standaard.be/rss/tag/beursexperts'),
        ('En nu even elders', 'http://www.standaard.be/rss/tag/blog-en-nu-even-elders'),
        ('Marketingblog', 'http://www.standaard.be/rss/tag/marketingblog'),
        ('TV-blog', 'http://www.standaard.be/rss/tag/tv-blog'),
        # Interactie
        ('Opinies', 'http://feeds.feedburner.com/dso-meningen-opinie')
	]
    
    keep_only_tags = [
        dict(name='header', attrs={'class':'article__header'}),
        dict(name='footer', attrs={'class':'article__meta'}),
        #dict(name='div', attrs={'class':['article', 'article__body', 'slideshow__intro']}),
        dict(name='article', attrs={'class':'article-full'}),
        dict(name='figure', attrs={'class':'article__image'})
    ]

    remove_tags = [
        dict(name=['embed', 'object']),
        dict(name='div', attrs={'class':['note NotePortrait', 'note']}),
        dict(name='ul', attrs={'class':re.compile('article__share')}),
        dict(name='div', attrs={'class':'slideshow__controls'}),
        dict(name='a', attrs={'role':'button'}),
        dict(name='figure', attrs={'class':re.compile('video')})
    ]

    remove_attributes = ['width', 'height']

    def preprocess_html(self, soup):
        del soup.body['onload']
        for item in soup.findAll(style=True):
            del item['style']
        return soup
De standaard seems to have a slightly different structure for it's webpages. I hade to make a little modification to the keep_only_tags:
Code:
        #dict(name='div', attrs={'class':['article', 'article__body', 'slideshow__intro']}),
        dict(name='article', attrs={'class':'article-full'}),
oCkz7bJ_ is offline   Reply With Quote
Old 08-05-2016, 08:47 AM   #14
Kunvp
Enthusiast
Kunvp began at the beginning.
 
Kunvp's Avatar
 
Posts: 43
Karma: 10
Join Date: Oct 2012
Location: Belgium
Device: Promedia e-reader (Onyx C67ML) - Aldi2016 / Former: Sony PRS-T1&T2
Quote:
Originally Posted by oCkz7bJ_ View Post
Here's a recipe that seems to work for De Standaard:
[CODE]
Thank you. I'll give it a try.
Koen
Kunvp is offline   Reply With Quote
Old 07-21-2017, 01:31 PM   #15
dldrmsmn
Junior Member
dldrmsmn began at the beginning.
 
Posts: 1
Karma: 10
Join Date: Jul 2017
Device: Kobo H2O
I just bumped into this thread through google search. Thanks so much for making this code and making ereaders a tad more worthwhile. I have to say I haven't tried it yet, but I have full digital access to De Sandaard and was wondering if there's any way I can download that day's newspaper to read it on my ereader? Probably not without a serious overhaul?
dldrmsmn is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
All recipes broken for me...? NSILMike Recipes 11 06-24-2016 08:45 PM
Gamasutra recipes broken tom_a_sparks Recipes 8 11-11-2015 12:25 PM
E-reader with Dutch/English or Dutch/Polish dictionary tttx Which one should I buy? 17 08-20-2015 05:42 AM
Dutch: de Volkskrant (subscription) is broken cnsmr Recipes 9 07-03-2012 06:31 PM
Times Of India, DNA recipes broken? mihirp Recipes 1 09-23-2011 03:09 PM


All times are GMT -4. The time now is 11:50 PM.


MobileRead.com is a privately owned, operated and funded community.