![]() |
#1 |
Junior Member
![]() Posts: 3
Karma: 10
Join Date: Dec 2024
Device: elipsa
|
recipe request for demorgen.be
I'm rather new to Calibre and ereaders. I am trying to download the news from the website: https://www.demorgen.be with calibre. But I do only get the titles and not the actual content. I looked at the source code of the built in recipe which I believe is this link.
Most of the articles are behind a paywall, but it's a "soft" paywall. If you disable JavaScript or use the reader modus of a browser, you can read all articles. Is there anyone that can help with a new recipe? I am a bit technical, but programming/Python is not one of my strengths unfortunately ![]() |
![]() |
![]() |
![]() |
#2 |
Junior Member
![]() Posts: 3
Karma: 10
Join Date: Dec 2024
Device: elipsa
|
OK, so I figured out how to do it. It's not a perfect recipe, but at least it shows all the content again. I'd like to have it refined though.
Code:
#!/usr/bin/env python2 __license__ = 'GPL v3' __copyright__ = '2008, Darko Miletic <darko.miletic at gmail.com>' ''' demorgen.be ''' from calibre.web.feeds.news import BasicNewsRecipe class DeMorganBe(BasicNewsRecipe): title = u'De Morgen' __author__ = u'Darko Miletic' description = u'News from Belgium in Dutch' oldest_article = 3 language = 'nl_BE' max_articles_per_feed = 100 no_stylesheets = True use_embedded_content = False keep_only_tags = [ dict(name='div', attrs={'class': 'reader-title'}), dict(name='h1'), dict(name='div', attrs={'class': 'credits'}), dict(name='div', attrs={'class': 'meta-data'}), # dict(name='div', attrs={'class': 'moz-reader-block-img'}), dict(name='img'), dict(name='div', attrs={'class': 'header-intro'}), dict(name='p'), ] feeds = [ (u'Nieuws', u'http://www.demorgen.be/nieuws/rss.xml'), (u'In het nieuws', u'https://www.demorgen.be/in-het-nieuws/rss.xml'), (u'Niet te missen', u'https://www.demorgen.be/niet-te-missen/rss.xml'), (u'Beter leven', u'http://www.demorgen.be/beter-leven/rss.xml'), (u'Crisis Midden-Oosten', u'http://www.demorgen.be/aanval-op-israel/rss.xml'), (u'Koken met de Morgen', u'http://www.demorgen.be/koken-met-de-morgen/rss.xml'), (u'Meningen', u'http://www.demorgen.be/meningen/rss.xml'), (u'Politiek', u'http://www.demorgen.be/politiek/rss.xml'), (u'TV & Cultuur', u'http://www.demorgen.be/tv-cultuur/rss.xml'), (u'Oorlog in Oekraine', u'http://www.demorgen.be/oorlog-in-oekraine/rss.xml'), (u'Tech & Wetenschap', u'http://www.demorgen.be/tech-wetenschap/rss.xml'), (u'Sport', u'http://www.demorgen.be/sport/rss.xml'), (u'Podcasts', u'http://www.demorgen.be/podcasts/rss.xml'), (u'Puzzels', u'http://www.demorgen.be/puzzels/rss.xml'), (u'Cartoons', u'http://www.demorgen.be/puzzels-cartoons/rss.xml'), (u'Achter de schermen', u'http://www.demorgen.be/achter-de-schermen/rss.xml'), (u'Best gelezen', u'http://www.demorgen.be/popular/rss.xml') ] |
![]() |
![]() |
![]() |
#3 |
Junior Member
![]() Posts: 3
Karma: 10
Join Date: Dec 2024
Device: elipsa
|
Unless mistaken, I can't edit my post above anymore, I made the recipe better yesterday evening. The most annoying thing still, is empty pages between articles now. I also have pictures in it now, but still some unwanted which I don't know how to exclude without excluding all pictures. Also the cover picture isn't ideal and titles of "chapters" don't match the actual content. but yeah, ... here's the new code that still needs work. At least, the content is there again
![]() Code:
#!/usr/bin/env python2 __license__ = 'GPL v3' __copyright__ = '2008, Darko Miletic <darko.miletic at gmail.com>' ''' demorgen.be ''' from calibre.web.feeds.news import BasicNewsRecipe class DeMorganBe(BasicNewsRecipe): title = u'De Morgen' __author__ = u'Darko Miletic' description = u'News from Belgium in Dutch' oldest_article = 1 language = 'nl_BE' max_articles_per_feed = 100 no_stylesheets = False use_embedded_content = False def get_cover_url(self): cover_url = "https://usercontent.one/wp/www.insidejazz.be/wp-content/uploads/2018/11/pic0143.png" return cover_url keep_only_tags = [ dict(name='div', attrs={'class': 'reader-title'}), dict(name='h1'), dict(name='div', attrs={'class': 'credits'}), dict(name='div', attrs={'class': 'meta-data'}), dict(name='div', attrs={'class': 'moz-reader-block-img'}), dict(name='img'), dict(name='div', attrs={'class': 'header-intro'}), dict(name='p'), ] remove_tags = [ # dict(name='script'), dict(name='p', attrs={'class': 'rtlowr1'}), dict(name='p', attrs={'class': 'qmn3qt1'}), dict(name='img', attrs={'class': '_1ubw0re1 _3ej1u36'}), dict(name='img', attrs={'class': '_15tatjw0'}), # dict(name='ul', attrs={'class': 'bulletSeparatedList'}), # dict(name='a', attrs={'class': 'shareImage'}), dict(name='h2'), ] feeds = [ (u'Nieuws', u'http://www.demorgen.be/nieuws/rss.xml'), (u'In het nieuws', u'https://www.demorgen.be/in-het-nieuws/rss.xml'), (u'Niet te missen', u'https://www.demorgen.be/niet-te-missen/rss.xml'), (u'Beter leven', u'http://www.demorgen.be/beter-leven/rss.xml'), (u'Crisis Midden-Oosten', u'http://www.demorgen.be/aanval-op-israel/rss.xml'), # (u'Koken met de Morgen', u'http://www.demorgen.be/koken-met-de-morgen/rss.xml'), (u'Meningen', u'http://www.demorgen.be/meningen/rss.xml'), (u'Politiek', u'http://www.demorgen.be/politiek/rss.xml'), (u'TV & Cultuur', u'http://www.demorgen.be/tv-cultuur/rss.xml'), (u'Oorlog in Oekraine', u'http://www.demorgen.be/oorlog-in-oekraine/rss.xml'), (u'Tech & Wetenschap', u'http://www.demorgen.be/tech-wetenschap/rss.xml'), # (u'Sport', u'http://www.demorgen.be/sport/rss.xml'), # (u'Podcasts', u'http://www.demorgen.be/podcasts/rss.xml'), # (u'Puzzels', u'http://www.demorgen.be/puzzels/rss.xml'), # (u'Cartoons', u'http://www.demorgen.be/puzzels-cartoons/rss.xml'), # (u'Achter de schermen', u'http://www.demorgen.be/achter-de-schermen/rss.xml'), # (u'Best gelezen', u'http://www.demorgen.be/popular/rss.xml') ] |
![]() |
![]() |
![]() |
#4 |
Fanatic
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 584
Karma: 82946
Join Date: May 2021
Device: kindle
|
|
![]() |
![]() |
![]() |
Thread Tools | Search this Thread |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Recipe Request | NSILMike | Recipes | 0 | 02-07-2023 01:58 PM |
recipe request | polymath | Recipes | 0 | 05-22-2013 06:09 PM |
recipe request | chell1948 | Recipes | 1 | 06-02-2011 01:23 PM |
recipe request | Torx | Recipes | 0 | 12-20-2010 08:33 AM |
Request for Recipe | ddavtian | Calibre | 2 | 11-24-2008 02:43 AM |