Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 12-24-2024, 03:00 AM   #1
bucovaina78
Junior Member
bucovaina78 began at the beginning.
 
Posts: 3
Karma: 10
Join Date: Dec 2024
Device: elipsa
recipe request for demorgen.be

I'm rather new to Calibre and ereaders. I am trying to download the news from the website: https://www.demorgen.be with calibre. But I do only get the titles and not the actual content. I looked at the source code of the built in recipe which I believe is this link.

Most of the articles are behind a paywall, but it's a "soft" paywall. If you disable JavaScript or use the reader modus of a browser, you can read all articles.

Is there anyone that can help with a new recipe? I am a bit technical, but programming/Python is not one of my strengths unfortunately
bucovaina78 is offline   Reply With Quote
Old 01-03-2025, 02:32 PM   #2
bucovaina78
Junior Member
bucovaina78 began at the beginning.
 
Posts: 3
Karma: 10
Join Date: Dec 2024
Device: elipsa
OK, so I figured out how to do it. It's not a perfect recipe, but at least it shows all the content again. I'd like to have it refined though.
  • All images are loaded. If you want images, uncomment the corresponding line, but it will make the epub a lot larger. Is there an option to exclude images? For example, the header image with the logo of demorgen.be is on every page. I would like to get that excluded.
  • I'd like to have the author in it as well, but somehow, I can't make that work
  • Estimated reading time is also in the article, also don't know how to get it in.
  • I want to exclude: loading social media posts

Code:
#!/usr/bin/env python2

__license__ = 'GPL v3'
__copyright__ = '2008, Darko Miletic <darko.miletic at gmail.com>'
'''
demorgen.be
'''

from calibre.web.feeds.news import BasicNewsRecipe


class DeMorganBe(BasicNewsRecipe):
    title = u'De Morgen'
    __author__ = u'Darko Miletic'
    description = u'News from Belgium in Dutch'
    oldest_article = 3
    language = 'nl_BE'

    max_articles_per_feed = 100
    no_stylesheets = True
    use_embedded_content = False

    keep_only_tags = [
        dict(name='div', attrs={'class': 'reader-title'}),
        dict(name='h1'),
        dict(name='div', attrs={'class': 'credits'}),
        dict(name='div', attrs={'class': 'meta-data'}),
#        dict(name='div', attrs={'class': 'moz-reader-block-img'}), dict(name='img'),
        dict(name='div', attrs={'class': 'header-intro'}),
        dict(name='p'),
    ]

    feeds = [
        (u'Nieuws', u'http://www.demorgen.be/nieuws/rss.xml'),
        (u'In het nieuws', u'https://www.demorgen.be/in-het-nieuws/rss.xml'),
        (u'Niet te missen', u'https://www.demorgen.be/niet-te-missen/rss.xml'),
        (u'Beter leven', u'http://www.demorgen.be/beter-leven/rss.xml'),
        (u'Crisis Midden-Oosten', u'http://www.demorgen.be/aanval-op-israel/rss.xml'),
        (u'Koken met de Morgen', u'http://www.demorgen.be/koken-met-de-morgen/rss.xml'),
        (u'Meningen', u'http://www.demorgen.be/meningen/rss.xml'),
        (u'Politiek', u'http://www.demorgen.be/politiek/rss.xml'),
        (u'TV & Cultuur', u'http://www.demorgen.be/tv-cultuur/rss.xml'),
        (u'Oorlog in Oekraine', u'http://www.demorgen.be/oorlog-in-oekraine/rss.xml'),
        (u'Tech & Wetenschap', u'http://www.demorgen.be/tech-wetenschap/rss.xml'),
        (u'Sport', u'http://www.demorgen.be/sport/rss.xml'),
        (u'Podcasts', u'http://www.demorgen.be/podcasts/rss.xml'),
        (u'Puzzels', u'http://www.demorgen.be/puzzels/rss.xml'),
        (u'Cartoons', u'http://www.demorgen.be/puzzels-cartoons/rss.xml'),
        (u'Achter de schermen', u'http://www.demorgen.be/achter-de-schermen/rss.xml'),
        (u'Best gelezen', u'http://www.demorgen.be/popular/rss.xml')
    ]
bucovaina78 is offline   Reply With Quote
Old 01-04-2025, 03:36 AM   #3
bucovaina78
Junior Member
bucovaina78 began at the beginning.
 
Posts: 3
Karma: 10
Join Date: Dec 2024
Device: elipsa
Unless mistaken, I can't edit my post above anymore, I made the recipe better yesterday evening. The most annoying thing still, is empty pages between articles now. I also have pictures in it now, but still some unwanted which I don't know how to exclude without excluding all pictures. Also the cover picture isn't ideal and titles of "chapters" don't match the actual content. but yeah, ... here's the new code that still needs work. At least, the content is there again

Code:
#!/usr/bin/env python2

__license__ = 'GPL v3'
__copyright__ = '2008, Darko Miletic <darko.miletic at gmail.com>'
'''
demorgen.be
'''

from calibre.web.feeds.news import BasicNewsRecipe


class DeMorganBe(BasicNewsRecipe):
    title = u'De Morgen'
    __author__ = u'Darko Miletic'
    description = u'News from Belgium in Dutch'
    oldest_article = 1
    language = 'nl_BE'

    max_articles_per_feed = 100
    no_stylesheets = False
    use_embedded_content = False

    def get_cover_url(self):
        cover_url = "https://usercontent.one/wp/www.insidejazz.be/wp-content/uploads/2018/11/pic0143.png"
        return cover_url

    keep_only_tags = [
        dict(name='div', attrs={'class': 'reader-title'}),
        dict(name='h1'),
        dict(name='div', attrs={'class': 'credits'}),
        dict(name='div', attrs={'class': 'meta-data'}),
        dict(name='div', attrs={'class': 'moz-reader-block-img'}), dict(name='img'),
        dict(name='div', attrs={'class': 'header-intro'}),
        dict(name='p'),
    ]
    remove_tags = [
 #       dict(name='script'),
        dict(name='p', attrs={'class': 'rtlowr1'}),
        
        dict(name='p', attrs={'class': 'qmn3qt1'}),
        dict(name='img', attrs={'class': '_1ubw0re1 _3ej1u36'}),
        dict(name='img', attrs={'class': '_15tatjw0'}),
 #       dict(name='ul', attrs={'class': 'bulletSeparatedList'}),
 #       dict(name='a', attrs={'class': 'shareImage'}),
        dict(name='h2'),
    ]

    feeds = [
        (u'Nieuws', u'http://www.demorgen.be/nieuws/rss.xml'),
        (u'In het nieuws', u'https://www.demorgen.be/in-het-nieuws/rss.xml'),
        (u'Niet te missen', u'https://www.demorgen.be/niet-te-missen/rss.xml'),
        (u'Beter leven', u'http://www.demorgen.be/beter-leven/rss.xml'),
        (u'Crisis Midden-Oosten', u'http://www.demorgen.be/aanval-op-israel/rss.xml'),
#        (u'Koken met de Morgen', u'http://www.demorgen.be/koken-met-de-morgen/rss.xml'),
        (u'Meningen', u'http://www.demorgen.be/meningen/rss.xml'),
        (u'Politiek', u'http://www.demorgen.be/politiek/rss.xml'),
        (u'TV & Cultuur', u'http://www.demorgen.be/tv-cultuur/rss.xml'),
        (u'Oorlog in Oekraine', u'http://www.demorgen.be/oorlog-in-oekraine/rss.xml'),
        (u'Tech & Wetenschap', u'http://www.demorgen.be/tech-wetenschap/rss.xml'),
#        (u'Sport', u'http://www.demorgen.be/sport/rss.xml'),
#        (u'Podcasts', u'http://www.demorgen.be/podcasts/rss.xml'),
#        (u'Puzzels', u'http://www.demorgen.be/puzzels/rss.xml'),
#        (u'Cartoons', u'http://www.demorgen.be/puzzels-cartoons/rss.xml'),
#        (u'Achter de schermen', u'http://www.demorgen.be/achter-de-schermen/rss.xml'),
#        (u'Best gelezen', u'http://www.demorgen.be/popular/rss.xml')
    ]
bucovaina78 is offline   Reply With Quote
Old 01-11-2025, 02:37 AM   #4
unkn0wn
Fanatic
unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.
 
Posts: 584
Karma: 82946
Join Date: May 2021
Device: kindle
https://github.com/kovidgoyal/calibr...e0e5fdd76c6bf8
unkn0wn is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Recipe Request NSILMike Recipes 0 02-07-2023 01:58 PM
recipe request polymath Recipes 0 05-22-2013 06:09 PM
recipe request chell1948 Recipes 1 06-02-2011 01:23 PM
recipe request Torx Recipes 0 12-20-2010 08:33 AM
Request for Recipe ddavtian Calibre 2 11-24-2008 02:43 AM


All times are GMT -4. The time now is 05:15 AM.


MobileRead.com is a privately owned, operated and funded community.