View Single Post
Old 12-29-2023, 05:42 AM   #6
Villard
Connoisseur
Villard began at the beginning.
 
Posts: 74
Karma: 10
Join Date: May 2016
Device: Koreader running on Kobo Libra 2
Hello,
I’m still looking how to get “pure” English pages.
My notes :
1/ There are different homepages according to the region : https://theconversation.com/au/ , https://theconversation.com/fr/ , https://theconversation.com/uk/.
But all the articles, whatever their language and region , are stored at the root of https://theconversation.com/
There is a banner at the top of the webpage indicating the edition allowing to choose another edition. By default, the edition is set to French.
I’m using this article https://theconversation.com/thinking...-fading-216078 , which has been published only in the Australian edition. This article is not visible within the navigation of the French Edition.

2/ So when I paste the above URL in my web browser, I’m by default under the French Edition banner. The page is viewed via the French Edition and some contents of the page, as the Disclosure Statement section are in French. When I save with this page on my PC, the corresponding tag contents are in French.
I then change the edition and set it to Australia, then access to the same article by navigating within the Australian edition. Then all the contents are only in English and if I save with this page on my PC, all the tag contents are of course in English.

The two saved pages differ also by some tags like these ones :
<html lang="fr-FR" class="svg js">
[…]
<meta name="current-region" content="fr">
<meta http-equiv="Content-Language" content="fr-FR">


<html lang="en-GB" class="svg js">
[…]
<meta name="current-region" content="uk">
<meta http-equiv="Content-Language" content="en-GB">



3/ So, I thought the recipe (see below) would be fine because the feeds are based on the Australian Edition https://theconversation.com/au/health/articles.atom.
But this is not the case ! The output epub contains some tag content in French. So I’ve to force BeautifulSoup to access to the page by somehow navigating inside the Australian Edition and not by using the default Edition linked to my French region.


How could I do that ? I’ve not yet tried the suggestion given by unkn0wn.

Anyway, there may have some dynamic elements in the article URL to be taken care of before the page is downloaded.


the Recipe :

from calibre.web.feeds.news import BasicNewsRecipe
class TheConversation(BasicNewsRecipe):
title = u'The Conversation'
language = 'en'
__author__ = 'Krittika Goyal'
oldest_article = 4 # days
max_articles_per_feed = 20
use_embedded_content = False
no_stylesheets = True
auto_cleanup = True
feeds = [
('Arts + Culture', 'https://theconversation.com/au/arts/articles.atom'),
('Health + Medicine', 'https://theconversation.com/au/health/articles.atom'),
]
calibre_most_common_ua = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.87 Safari/537.36'

Last edited by Villard; 12-29-2023 at 05:52 AM.
Villard is offline   Reply With Quote