|
![]() |
|
Thread Tools | Search this Thread |
![]() |
#1 |
Connoisseur
![]() Posts: 64
Karma: 10
Join Date: May 2016
Device: Koreader running on Kobo Libra 2
|
recipe The Conversation : Why some parts of the article are in a different language
Hello
I'm using the buitin recipe for the Conversation. This recipe is based on the Australian edition of The Conversation. https://theconversation.com/au/articles.atom I'm French and I'm using a French Windows PC, and my browser is in French. The Conversation exists also in French : https://theconversation.com/fr The strange thing is that the articles in the downloaded edition of The Conversation contain some French parts. For example "Read also" is replaced by "A lire aussi". The text of the article is of course in English, but some information is in French. It occurs also for the menu of the pages (which appears when change the auto_cleanup value to False). NB : In my brower (Chrome), the original article doesn't contains any French word. It seems the recipe replaces some tag contents by the French translation, because there is How can I get rid of that and have only English in all the articles ? Of course I can remove some irrelevant tags but I want for example to keep the Disclosure statement which is given in French and not in English as it should be. Original Disclosure statement <section class="content-disclosure-statement"> <h3 class="border">Disclosure statement</h3> <p><span>Mark A. XXXX[ name hidden by me] does not work for, consult, own shares in or receive funding from any company or organisation that would benefit from this article, and has disclosed no relevant affiliations beyond their academic appointment.</span></p> </section> Disclosure statement in the output <div class="calibre10"> <h3 class="border">Déclaration d’intérêts</h3> <p class="role"><span class="calibre13">Mark A. XXXX ne travaille pas, ne conseille pas, ne possède pas de parts, ne reçoit pas de fonds d'une organisation qui pourrait tirer profit de cet article, et n'a déclaré aucune autre affiliation que son organisme de recherche.</span></p> </div> Thanks Villard Last edited by Villard; 12-27-2023 at 01:30 PM. |
![]() |
![]() |
![]() |
#2 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 44,033
Karma: 22669822
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
THe recipe does not do any translation, if thats whats present it will be because thats whats present in the downloaded html.
|
![]() |
![]() |
![]() |
#3 |
Connoisseur
![]() Posts: 64
Karma: 10
Join Date: May 2016
Device: Koreader running on Kobo Libra 2
|
Hello Kovid
Of course I know the recipe does not translate. But anyway I got this strange result. I'm not an expert in HTML, my idea would be there may be some scripts in the webpage that the recipe does not handle properly. I try again today with this article : https://theconversation.com/as-aussi...akeries-214378 in my Chrome browser all the parts of the page are in English. The section Disclosure Statement is well in English : <section class="content-disclosure-statement"> <h3 class="border">Disclosure statement</h3> <p><span>Garritt C Van Dyk does not work for, consult, own shares in or receive funding from any company or organisation that would benefit from this article, and has disclosed no relevant affiliations beyond their academic appointment.</span></p> </section> In my epub, it is in French <div class="calibre8"> <h3 class="calibre9">Déclaration d’intérêts</h3> <p class="role"><span>Garritt C Van Dyk ne travaille pas, ne conseille pas, ne possède pas de parts, ne reçoit pas de fonds d'une organisation qui pourrait tirer profit de cet article, et n'a déclaré aucune autre affiliation que son organisme de recherche.</span></p> </div> I use this builtin recipe (I have juste changed auto_cleanup = True by auto_cleanup = False, but the strange phenomenon appears in both cases ) : from calibre.web.feeds.news import BasicNewsRecipe class TheConversation(BasicNewsRecipe): title = u'The Conversation' language = 'en' __author__ = 'Krittika Goyal' oldest_article = 4 # days max_articles_per_feed = 20 use_embedded_content = False no_stylesheets = True auto_cleanup = False feeds = [ ('Arts + Culture', 'https://theconversation.com/au/arts/articles.atom'), ('Business + Economy','https://theconversation.com/au/business/articles.atom'), ] calibre_most_common_ua = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.87 Safari/537.36' |
![]() |
![]() |
![]() |
#4 |
Evangelist
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 468
Karma: 82692
Join Date: May 2021
Device: kindle
|
maybe the RSS feeds in this recipe also contain links to the french articles.
OR that english article page contains hidden tags with french text. you can add something like if url.__contains__ ('theconversation.com/fr/') return '' in def print_version |
![]() |
![]() |
![]() |
#5 |
Connoisseur
![]() Posts: 64
Karma: 10
Join Date: May 2016
Device: Koreader running on Kobo Libra 2
|
Hello unkn0wn
I'll try your suggestion with the url.__contains__. I'm not familiar with that, so I'll need some time. the RSS feeds contain only english articles. and the only link is the one indicating the different homepage for each region. Inside the original webpage , there is no tag with French text. It's very weird |
![]() |
![]() |
![]() |
#6 |
Connoisseur
![]() Posts: 64
Karma: 10
Join Date: May 2016
Device: Koreader running on Kobo Libra 2
|
Hello,
I’m still looking how to get “pure” English pages. My notes : 1/ There are different homepages according to the region : https://theconversation.com/au/ , https://theconversation.com/fr/ , https://theconversation.com/uk/. But all the articles, whatever their language and region , are stored at the root of https://theconversation.com/ There is a banner at the top of the webpage indicating the edition allowing to choose another edition. By default, the edition is set to French. I’m using this article https://theconversation.com/thinking...-fading-216078 , which has been published only in the Australian edition. This article is not visible within the navigation of the French Edition. 2/ So when I paste the above URL in my web browser, I’m by default under the French Edition banner. The page is viewed via the French Edition and some contents of the page, as the Disclosure Statement section are in French. When I save with this page on my PC, the corresponding tag contents are in French. I then change the edition and set it to Australia, then access to the same article by navigating within the Australian edition. Then all the contents are only in English and if I save with this page on my PC, all the tag contents are of course in English. The two saved pages differ also by some tags like these ones : <html lang="fr-FR" class="svg js"> […] <meta name="current-region" content="fr"> <meta http-equiv="Content-Language" content="fr-FR"> <html lang="en-GB" class="svg js"> […] <meta name="current-region" content="uk"> <meta http-equiv="Content-Language" content="en-GB"> 3/ So, I thought the recipe (see below) would be fine because the feeds are based on the Australian Edition https://theconversation.com/au/health/articles.atom. But this is not the case ! The output epub contains some tag content in French. So I’ve to force BeautifulSoup to access to the page by somehow navigating inside the Australian Edition and not by using the default Edition linked to my French region. How could I do that ? I’ve not yet tried the suggestion given by unkn0wn. Anyway, there may have some dynamic elements in the article URL to be taken care of before the page is downloaded. the Recipe : from calibre.web.feeds.news import BasicNewsRecipe class TheConversation(BasicNewsRecipe): title = u'The Conversation' language = 'en' __author__ = 'Krittika Goyal' oldest_article = 4 # days max_articles_per_feed = 20 use_embedded_content = False no_stylesheets = True auto_cleanup = True feeds = [ ('Arts + Culture', 'https://theconversation.com/au/arts/articles.atom'), ('Health + Medicine', 'https://theconversation.com/au/health/articles.atom'), ] calibre_most_common_ua = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.87 Safari/537.36' Last edited by Villard; 12-29-2023 at 05:52 AM. |
![]() |
![]() |
![]() |
#7 |
Evangelist
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 468
Karma: 82692
Join Date: May 2021
Device: kindle
|
Code:
def get_browser(self, *a, **kw): br = BasicNewsRecipe.get_browser(self, *a, **kw) br.set_cookie('tc_region', 'au', '.theconversation.com') return br |
![]() |
![]() |
![]() |
#8 |
Connoisseur
![]() Posts: 64
Karma: 10
Join Date: May 2016
Device: Koreader running on Kobo Libra 2
|
Well done Unkn0wn ! It works !
You got the solution ! Thank you very much ! Now I'll take time to understand your piece of code ... Thanks again ! |
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Why Wkipedia search does not querry the actual language list of any given article? | MaxStirner | KOReader | 3 | 09-07-2023 02:38 PM |
skip article? How is that possible in a recipe | VoHegg | Recipes | 3 | 08-23-2020 06:01 AM |
Single Article Recipe | yank | Recipes | 1 | 10-25-2018 09:06 AM |
NEW RECIPE REQUEST: The Conversation | XanthanGum | Recipes | 0 | 01-14-2015 05:49 AM |
Which Article? - Non-fiction, language study, exam preparation | Addriana | Self-Promotions by Authors and Publishers | 0 | 12-07-2013 07:42 PM |