MobileRead Forums - View Single Post - (mostly) Russian and Ukrainian sources: state of built-in recipes, fixes, new recipes

bugmen00t · 07-21-2022, 05:08 PM

A couple new recipes. Not sure how to correctly specify the language for those recipes that are being downloaded in Russian but their originating source is outside Russia. To avoid fragmentation, it probably would be easier to change language from "ru_UK", "ru_DE", "ru_GB" back to just "ru".

UNIAN.net (Russian version): Ukrainian Independent News Agency of News, one of the most cited source of news from across Ukraine. Favicon

Spoiler:

Code:

#!/usr/bin/env python
# vim:fileencoding=utf-8

from __future__ import unicode_literals, division, absolute_import, print_function
from calibre.web.feeds.news import BasicNewsRecipe, classes

class Unian(BasicNewsRecipe):
    title = u'\u0423\u041D\u0418\u0410\u041D '
    description = u'\u0423\u043A\u0440\u0430\u0438\u043D\u0441\u043A\u043E\u0435 \u041D\u0435\u0437\u0430\u0432\u0438\u0441\u0438\u043C\u043E\u0435 \u0418\u043D\u0444\u043E\u0440\u043C\u0430\u0446\u0438\u043E\u043D\u043D\u043E\u0435 \u0410\u0433\u0435\u043D\u0442\u0441\u0442\u0432\u043E \u041D\u043E\u0432\u043E\u0441\u0442\u0435\u0439 \u2013 \u043F\u0435\u0440\u0432\u043E\u0435 \u0432 \u0423\u043A\u0440\u0430\u0438\u043D\u0435 \u0438 \u0441\u0430\u043C\u043E\u0435 \u0431\u043E\u043B\u044C\u0448\u043E\u0435 \u043D\u0435\u0437\u0430\u0432\u0438\u0441\u0438\u043C\u043E\u0435 \u0438\u043D\u0444\u043E\u0440\u043C\u0430\u0446\u0438\u043E\u043D\u043D\u043E\u0435 \u0430\u0433\u0435\u043D\u0442\u0441\u0442\u0432\u043E, \u043E\u0441\u043D\u043E\u0432\u0430\u043D\u043D\u043E\u0435 \u0432 1993 \u0433\u043E\u0434\u0443, \u043B\u0438\u0434\u0435\u0440 \u0441\u0440\u0435\u0434\u0438 \u043D\u043E\u0432\u043E\u0441\u0442\u043D\u044B\u0445 \u043C\u0435\u0434\u0438\u0430 \u0441\u0442\u0440\u0430\u043D\u044B, \u0441\u0430\u043C\u044B\u0439 \u0446\u0438\u0442\u0438\u0440\u0443\u0435\u043C\u044B\u0439 \u0438\u0441\u0442\u043E\u0447\u043D\u0438\u043A \u043D\u043E\u0432\u043E\u0441\u0442\u0435\u0439 \u043E \u0441\u043E\u0431\u044B\u0442\u0438\u044F\u0445 \u0432 \u0441\u0442\u0440\u0430\u043D\u0435.'
    __author__ = 'bugmen00t'
    publication_type = 'newspaper'
    oldest_article = 7
    max_articles_per_feed = 100
    language = 'ru_UK'
    cover_url = 'https://www.unian.net/images/unian-512x512.png'
    auto_cleanup = False
    no_stylesheets = True
    
    remove_tags_before = dict(name='h1')
    remove_tags_after = dict(name='div', attrs={'class': 'article-text'})
    remove_tags = [
        dict(name='span', attrs={'class': 'article__info-item comments'}),
        dict(name='span', attrs={'class': 'article__info-item views'}),
        dict(name='div', attrs={'class': 'read-also-slider'})
    ]

    feeds = [
    (u'\u0423\u041D\u0418\u0410\u041D', u'https://rss.unian.net/site/news_rus.rss')
    ]

Old-Games.ru: community project devoted to preservation of old computer games. Favicon

Fixes needed:

Replace list elements with <div> tags
Find less brutal way of removing attribute style='display:none ' from all <div> tags rather than just nuking all styles

Spoiler:

Code:

#!/usr/bin/env python
# vim:fileencoding=utf-8

from __future__ import unicode_literals, division, absolute_import, print_function
from calibre.web.feeds.news import BasicNewsRecipe, classes


class OGRU(BasicNewsRecipe):
    title = u'Old-Games.RU'
    __author__ = 'bugmen00t'
    description = u'Old-Games.RU \u2014 \u043A\u0440\u0443\u043F\u043D\u0435\u0439\u0448\u0438\u0439 \u0440\u043E\u0441\u0441\u0438\u0439\u0441\u043A\u0438\u0439 \u0430\u0440\u0445\u0438\u0432 \u0441\u0442\u0430\u0440\u044B\u0445 \u043A\u043E\u043C\u043F\u044C\u044E\u0442\u0435\u0440\u043D\u044B\u0445 \u0438\u0433\u0440. \u041C\u044B \u043D\u0435 \u0441\u0442\u0430\u0432\u0438\u043C \u043F\u0435\u0440\u0435\u0434 \u0441\u043E\u0431\u043E\u0439 \u0446\u0435\u043B\u0438 \u0441\u043E\u0431\u0440\u0430\u0442\u044C \u0432\u0441\u0435 \u0438\u0433\u0440\u044B, \u0447\u0442\u043E \u0435\u0441\u0442\u044C \u0432 \u043C\u0438\u0440\u0435, \u043D\u043E \u043C\u044B \u0441\u0442\u0430\u0440\u0430\u0435\u043C\u0441\u044F, \u0447\u0442\u043E\u0431\u044B \u043D\u0430 \u0441\u0430\u0439\u0442\u0435 \u0431\u044B\u043B\u043E \u043F\u0440\u0435\u0434\u0441\u0442\u0430\u0432\u043B\u0435\u043D\u043E \u0431\u043E\u043B\u044C\u0448\u0438\u043D\u0441\u0442\u0432\u043E \u0448\u0435\u0434\u0435\u0432\u0440\u043E\u0432, \u0440\u0435\u0434\u043A\u043E\u0441\u0442\u0435\u0439 \u0438 \u043F\u0440\u043E\u0441\u0442\u043E \u0438\u043D\u0442\u0435\u0440\u0435\u0441\u043D\u044B\u0445 \u043F\u0440\u043E\u0435\u043A\u0442\u043E\u0432 \u043F\u0440\u043E\u0448\u043B\u044B\u0445 \u043B\u0435\u0442. \u0421 \u0442\u0435\u0447\u0435\u043D\u0438\u0435\u043C \u0432\u0440\u0435\u043C\u0435\u043D\u0438 \u0433\u0440\u0430\u0444\u0438\u0447\u0435\u0441\u043A\u043E\u0435 \u0438 \u0437\u0432\u0443\u043A\u043E\u0432\u043E\u0435 \u043E\u0444\u043E\u0440\u043C\u043B\u0435\u043D\u0438\u0435 \u0438\u0433\u0440 \u043D\u0430\u0448\u0435\u0433\u043E \u0430\u0440\u0445\u0438\u0432\u0430 \u0437\u0430\u043C\u0435\u0442\u043D\u043E \u0443\u0441\u0442\u0430\u0440\u0435\u043B\u043E, \u043D\u043E \u0438\u0433\u0440\u043E\u0432\u043E\u0439 \u043F\u0440\u043E\u0446\u0435\u0441\u0441 \u043E\u0441\u0442\u0430\u043B\u0441\u044F \u043F\u0440\u0435\u0436\u043D\u0438\u043C, \u0438 \u043F\u043E\u0440\u043E\u0439 \u043E\u043D \u0433\u043E\u0440\u0430\u0437\u0434\u043E \u0438\u043D\u0442\u0435\u0440\u0435\u0441\u043D\u0435\u0435, \u0447\u0435\u043C \u0432\u043E \u043C\u043D\u043E\u0433\u0438\u0445 \u0441\u043E\u0432\u0440\u0435\u043C\u0435\u043D\u043D\u044B\u0445 \u00AB\u0445\u0438\u0442\u0430\u0445\u00BB.'
    publisher = 'Old-Games.RU'
    publication_type = 'blog'
    category = 'news, games, retro'
    language = 'ru'
    cover_url = 'https://www.old-games.ru/forum/styles/default/old-games/logo.og.png'
    oldest_article = 50
    max_articles_per_feed = 50
    no_stylesheets = True
    auto_cleanup = False
    
    remove_tags_before = dict(name='article')

    remove_tags_after = dict(name='article')

    remove_attributes = ['style']

    remove_tags =   [
        dict(name='p', attrs={'id': 'pageDescription'}),
        dict(name='div', attrs={'class': 'pageNavLinkGroup'}),
        dict(name='div', attrs={'class': 'tagBlock TagContainer'}),
        dict(name='div', attrs={'class': 'NoAutoHeader PollContainer'}),
        dict(name='div', attrs={'class': 'likesSummary secondaryContent'}),
        dict(name='div', attrs={'class': 'editDate'}),
        dict(name='div', attrs={'class': 'attachedFiles'}),
        dict(name='div', attrs={'class': 'item muted postNumber hashPermalink OverlayTrigger'}),
        dict(name='div', attrs={'class': 'messageUserInfo'})
        ]

    feeds = [
        (u'\u041D\u043E\u0432\u043E\u0441\u0442\u0438', 'https://feeds.feedburner.com/Old-games-ru-news'),
        (u'\u0421\u0442\u0430\u0442\u044C\u0438', 'https://feeds.feedburner.com/Old-games-ru-articles')
        ]

Новая Газета. Европа (Russian version): European re-incarnation of Новая Газета newspaper. Favicon

Fixes needed:

No images in articles (webp format)

Spoiler:

Code:

#!/usr/bin/env python
# vim:fileencoding=utf-8

from __future__ import unicode_literals, division, absolute_import, print_function
from calibre.web.feeds.news import BasicNewsRecipe, classes

class NovayaGazetaEurope(BasicNewsRecipe):
    title = u'\u041D\u043E\u0432\u0430\u044F \u0413\u0430\u0437\u0435\u0442\u0430. \u0415\u0432\u0440\u043E\u043F\u0430'
    __author__ = 'bugmen00t'
    description = u'\u0413\u043E\u0432\u043E\u0440\u0438\u043C \u043A\u0430\u043A \u0435\u0441\u0442\u044C. \u041F\u0438\u0448\u0435\u043C \u043E \u043F\u0440\u043E\u0438\u0441\u0445\u043E\u0434\u044F\u0449\u0435\u043C \u0432 \u0420\u043E\u0441\u0441\u0438\u0438, \u0423\u043A\u0440\u0430\u0438\u043D\u0435 \u0438 \u0415\u0432\u0440\u043E\u043F\u0435. \u041D\u043E\u0432\u043E\u0441\u0442\u0438, \u0430\u043D\u0430\u043B\u0438\u0442\u0438\u043A\u0430, \u043C\u043D\u0435\u043D\u0438\u044F \u044D\u043A\u0441\u043F\u0435\u0440\u0442\u043E\u0432, \u0441\u043F\u0435\u0446\u0438\u0430\u043B\u044C\u043D\u044B\u0435 \u0440\u0435\u043F\u043E\u0440\u0442\u0430\u0436\u0438 \u0438 \u0436\u0443\u0440\u043D\u0430\u043B\u0438\u0441\u0442\u0441\u043A\u0438\u0435 \u0440\u0430\u0441\u0441\u043B\u0435\u0434\u043E\u0432\u0430\u043D\u0438\u044F.'
    publisher = '\u041A\u0438\u0440\u0438\u043B\u043B \u041C\u0430\u0440\u0442\u044B\u043D\u043E\u0432'
    publication_type = 'newspaper'
    category = 'news'
    language = 'ru'
    cover_url = 'https://bucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com/public/images/5dc71e2d-9763-4f05-8f4e-92049fa32af7_513x513.png'
    oldest_article = 15
    max_articles_per_feed = 50
    auto_cleanup = False
    
    remove_tags_before = dict(name='h1')

    remove_tags_after = dict(name='div', attrs={'class': 'ArticleBlocks_wrapperNoAside__11_bu'})

    remove_tags =   [
        dict(name='div', attrs={'class': 'EmbedNative_root__2lgsH'})
        ]

    feeds = [
        (u'\u041D\u043E\u0432\u043E\u0441\u0442\u0438', 'https://novayagazeta.eu/feed/rss/ru')
        ]
        
    def preprocess_html(self, soup):
        for alink in soup.findAll('a'):
            if alink.string is not None:
               tstr = alink.string
               alink.replaceWith(tstr)
        return soup

Вёрстка: socio-political online media researching Russian society. Favicon.

Spoiler:

Кедр: independent environmental media. Favicon.

Fixes needed:

No images in some articles (<figure> tag and/or webp format)

Spoiler:

Deutsche Welle на русском: Russian version of Deutsche Welle. Favicon.

Spoiler:

Code:

#!/usr/bin/env python
# vim:fileencoding=utf-8

from __future__ import unicode_literals, division, absolute_import, print_function
from calibre.web.feeds.news import BasicNewsRecipe, classes

class DeutscheWelle(BasicNewsRecipe):
    title = u'Deutsche Welle \u043D\u0430 \u0440\u0443\u0441\u0441\u043A\u043E\u043C'
    description = u'\u0420\u0443\u0441\u0441\u043A\u0430\u044F \u0440\u0435\u0434\u0430\u043A\u0446\u0438\u044F Deutsche Welle: \u043D\u043E\u0432\u043E\u0441\u0442\u0438, \u0430\u043D\u0430\u043B\u0438\u0442\u0438\u043A\u0430, \u043A\u043E\u043C\u043C\u0435\u043D\u0442\u0430\u0440\u0438\u0438 \u0438 \u0440\u0435\u043F\u043E\u0440\u0442\u0430\u0436\u0438 \u0438\u0437 \u0413\u0435\u0440\u043C\u0430\u043D\u0438\u0438 \u0438 \u0415\u0432\u0440\u043E\u043F\u044B, \u043D\u0435\u043C\u0435\u0446\u043A\u0438\u0439 \u0438 \u0435\u0432\u0440\u043E\u043F\u0435\u0439\u0441\u043A\u0438\u0439 \u0432\u0437\u0433\u043B\u044F\u0434 \u043D\u0430 \u0441\u043E\u0431\u044B\u0442\u0438\u044F \u0432 \u0420\u043E\u0441\u0441\u0438\u0438 \u0438 \u043C\u0438\u0440\u0435,  \u043F\u0440\u0430\u043A\u0442\u0438\u0447\u0435\u0441\u043A\u0438\u0435 \u0441\u043E\u0432\u0435\u0442\u044B \u0434\u043B\u044F \u0442\u0443\u0440\u0438\u0441\u0442\u043E\u0432 \u0438 \u0442\u0435\u0445, \u043A\u0442\u043E \u0436\u0435\u043B\u0430\u0435\u0442 \u0443\u0447\u0438\u0442\u044C\u0441\u044F \u0438\u043B\u0438 \u0440\u0430\u0431\u043E\u0442\u0430\u0442\u044C \u0432 \u0413\u0435\u0440\u043C\u0430\u043D\u0438\u0438 \u0438 \u0434\u0440\u0443\u0433\u0438\u0445 \u0441\u0442\u0440\u0430\u043D\u0430\u0445 \u0415\u0432\u0440\u043E\u0441\u043E\u044E\u0437\u0430.'
    __author__ = 'bugmen00t'
    publication_type = 'newspaper'
    oldest_article = 14
    max_articles_per_feed = 100
    language = 'ru_DE'
    cover_url = 'https://www.dw.com/cssi/dwlogo-print.gif'
    auto_cleanup = False
    no_stylesheets = False
    
    remove_tags_before = dict(name='h1')

    remove_tags_after = dict(name='div', attrs={'class': 'longText'})

    feeds = [
        (u'\u0412\u0435\u0441\u044C \u0441\u0430\u0439\u0442', 'https://rss.dw.com/xml/rss-ru-all'),
        (u'\u041D\u043E\u0432\u043E\u0441\u0442\u0438', 'http://rss.dw.de/xml/rss-ru-news'),
        (u'\u041F\u043E\u043B\u0438\u0442\u0438\u043A\u0430 \u0438 \u043E\u0431\u0449\u0435\u0441\u0442\u0432\u043E', 'http://rss.dw.de/xml/rss-ru-pol'),
        (u'\u042D\u043A\u043E\u043D\u043E\u043C\u0438\u043A\u0430', 'http://rss.dw.de/xml/rss-ru-eco'),
        (u'\u0410\u0432\u0442\u043E\u043C\u043E\u0431\u0438\u043B\u044C', 'http://rss.dw.de/xml/rss-ru-auto'),
        (u'\u041A\u0443\u043B\u044C\u0442\u0443\u0440\u0430 \u0438 \u0441\u0442\u0438\u043B\u044C \u0436\u0438\u0437\u043D\u0438', 'http://rss.dw.de/xml/rss-ru-cul'),
        (u'\u0420\u043E\u0441\u0441\u0438\u044F', 'http://rss.dw.de/xml/rss-ru-rus'),
        (u'\u0413\u0435\u0440\u043C\u0430\u043D\u0438\u044F', 'http://rss.dw.de/xml/rss-ru-ger'),
        (u'\u0415\u0432\u0440\u043E\u043F\u0430', 'http://rss.dw.de/xml/rss-ru-eu'),
        (u'\u0411\u0435\u043B\u0430\u0440\u0443\u0441\u044C', 'http://rss.dw.de/xml/rss-ru-bel'),
        (u'\u0423\u0447\u0435\u0431\u0430 \u0438 \u043A\u0430\u0440\u044C\u0435\u0440\u0430', 'http://rss.dw.de/xml/rss-ru-campus-karriere'),
        (u'\u0423\u0447\u0435\u0431\u0430 ', 'http://rss.dw.de/xml/rss-ru-campus'),
        (u'\u041A\u0430\u0440\u044C\u0435\u0440\u0430 ', 'http://rss.dw.de/xml/rss-ru-karriere'),
        (u'\u0422\u0443\u0440\u0438\u0441\u0442\u0443 \u043D\u0430 \u0437\u0430\u043C\u0435\u0442\u043A\u0443', 'http://rss.dw.de/xml/rss-ru-discover-ger')
    ]

Русская служба BBC: BBC News in Russian. Favicon.

Fixes needed:

No images in some articles (lazyload)
No images in some articles (webp format)

Spoiler: