Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 12-27-2023, 01:05 PM   #1
Villard
Connoisseur
Villard began at the beginning.
 
Posts: 64
Karma: 10
Join Date: May 2016
Device: Koreader running on Kobo Libra 2
recipe The Conversation : Why some parts of the article are in a different language

Hello
I'm using the buitin recipe for the Conversation. This recipe is based on the Australian edition of The Conversation.
https://theconversation.com/au/articles.atom


I'm French and I'm using a French Windows PC, and my browser is in French.

The Conversation exists also in French : https://theconversation.com/fr

The strange thing is that the articles in the downloaded edition of The Conversation contain some French parts. For example "Read also" is replaced by "A lire aussi". The text of the article is of course in English, but some information is in French.

It occurs also for the menu of the pages (which appears when change the auto_cleanup value to False).

NB : In my brower (Chrome), the original article doesn't contains any French word.

It seems the recipe replaces some tag contents by the French translation, because there is

How can I get rid of that and have only English in all the articles ?
Of course I can remove some irrelevant tags but I want for example to keep the Disclosure statement which is given in French and not in English as it should be.


Original Disclosure statement
<section class="content-disclosure-statement">
<h3 class="border">Disclosure statement</h3>
<p><span>Mark A. XXXX
[ name hidden by me] does not work for, consult, own shares in or receive funding from any company or organisation that would benefit from this article, and has disclosed no relevant affiliations beyond their academic appointment.</span></p>
</section>


Disclosure statement in the output
<div class="calibre10">
<h3 class="border">Déclaration d’intérêts</h3>
<p class="role"><span class="calibre13">Mark A. XXXX ne travaille pas, ne conseille pas, ne possède pas de parts, ne reçoit pas de fonds d'une organisation qui pourrait tirer profit de cet article, et n'a déclaré aucune autre affiliation que son organisme de recherche.</span></p>
</div>

Thanks
Villard

Last edited by Villard; 12-27-2023 at 01:30 PM.
Villard is offline   Reply With Quote
Old 12-27-2023, 09:26 PM   #2
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 44,033
Karma: 22669822
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
THe recipe does not do any translation, if thats whats present it will be because thats whats present in the downloaded html.
kovidgoyal is offline   Reply With Quote
Old 12-28-2023, 01:51 AM   #3
Villard
Connoisseur
Villard began at the beginning.
 
Posts: 64
Karma: 10
Join Date: May 2016
Device: Koreader running on Kobo Libra 2
Hello Kovid
Of course I know the recipe does not translate.
But anyway I got this strange result. I'm not an expert in HTML, my idea would be there may be some scripts in the webpage that the recipe does not handle properly.


I try again today with this article :
https://theconversation.com/as-aussi...akeries-214378


in my Chrome browser all the parts of the page are in English. The section Disclosure Statement is well in English :
<section class="content-disclosure-statement">
<h3 class="border">Disclosure statement</h3>
<p><span>Garritt C Van Dyk does not work for, consult, own shares in or receive funding from any company or organisation that would benefit from this article, and has disclosed no relevant affiliations beyond their academic appointment.</span></p>
</section>


In my epub, it is in French
<div class="calibre8">
<h3 class="calibre9">Déclaration d’intérêts</h3>
<p class="role"><span>Garritt C Van Dyk ne travaille pas, ne conseille pas, ne possède pas de parts, ne reçoit pas de fonds d'une organisation qui pourrait tirer profit de cet article, et n'a déclaré aucune autre affiliation que son organisme de recherche.</span></p>
</div>





I use this builtin recipe (I have juste changed auto_cleanup = True by auto_cleanup = False, but the strange phenomenon appears in both cases ) :
from calibre.web.feeds.news import BasicNewsRecipe
class TheConversation(BasicNewsRecipe):
title = u'The Conversation'
language = 'en'
__author__ = 'Krittika Goyal'
oldest_article = 4 # days
max_articles_per_feed = 20
use_embedded_content = False
no_stylesheets = True
auto_cleanup = False
feeds = [
('Arts + Culture', 'https://theconversation.com/au/arts/articles.atom'),
('Business + Economy','https://theconversation.com/au/business/articles.atom'),
]
calibre_most_common_ua = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.87 Safari/537.36'
Villard is offline   Reply With Quote
Old 12-28-2023, 03:08 AM   #4
unkn0wn
Evangelist
unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.
 
Posts: 468
Karma: 82692
Join Date: May 2021
Device: kindle
maybe the RSS feeds in this recipe also contain links to the french articles.
OR that english article page contains hidden tags with french text.

you can add something like if url.__contains__ ('theconversation.com/fr/') return '' in def print_version
unkn0wn is offline   Reply With Quote
Old 12-28-2023, 07:45 AM   #5
Villard
Connoisseur
Villard began at the beginning.
 
Posts: 64
Karma: 10
Join Date: May 2016
Device: Koreader running on Kobo Libra 2
Hello unkn0wn
I'll try your suggestion with the url.__contains__. I'm not familiar with that, so I'll need some time.

the RSS feeds contain only english articles. and the only link is the one indicating the different homepage for each region. Inside the original webpage , there is no tag with French text.

It's very weird
Villard is offline   Reply With Quote
Old 12-29-2023, 05:42 AM   #6
Villard
Connoisseur
Villard began at the beginning.
 
Posts: 64
Karma: 10
Join Date: May 2016
Device: Koreader running on Kobo Libra 2
Hello,
I’m still looking how to get “pure” English pages.
My notes :
1/ There are different homepages according to the region : https://theconversation.com/au/ , https://theconversation.com/fr/ , https://theconversation.com/uk/.
But all the articles, whatever their language and region , are stored at the root of https://theconversation.com/
There is a banner at the top of the webpage indicating the edition allowing to choose another edition. By default, the edition is set to French.
I’m using this article https://theconversation.com/thinking...-fading-216078 , which has been published only in the Australian edition. This article is not visible within the navigation of the French Edition.

2/ So when I paste the above URL in my web browser, I’m by default under the French Edition banner. The page is viewed via the French Edition and some contents of the page, as the Disclosure Statement section are in French. When I save with this page on my PC, the corresponding tag contents are in French.
I then change the edition and set it to Australia, then access to the same article by navigating within the Australian edition. Then all the contents are only in English and if I save with this page on my PC, all the tag contents are of course in English.

The two saved pages differ also by some tags like these ones :
<html lang="fr-FR" class="svg js">
[…]
<meta name="current-region" content="fr">
<meta http-equiv="Content-Language" content="fr-FR">


<html lang="en-GB" class="svg js">
[…]
<meta name="current-region" content="uk">
<meta http-equiv="Content-Language" content="en-GB">



3/ So, I thought the recipe (see below) would be fine because the feeds are based on the Australian Edition https://theconversation.com/au/health/articles.atom.
But this is not the case ! The output epub contains some tag content in French. So I’ve to force BeautifulSoup to access to the page by somehow navigating inside the Australian Edition and not by using the default Edition linked to my French region.


How could I do that ? I’ve not yet tried the suggestion given by unkn0wn.

Anyway, there may have some dynamic elements in the article URL to be taken care of before the page is downloaded.


the Recipe :

from calibre.web.feeds.news import BasicNewsRecipe
class TheConversation(BasicNewsRecipe):
title = u'The Conversation'
language = 'en'
__author__ = 'Krittika Goyal'
oldest_article = 4 # days
max_articles_per_feed = 20
use_embedded_content = False
no_stylesheets = True
auto_cleanup = True
feeds = [
('Arts + Culture', 'https://theconversation.com/au/arts/articles.atom'),
('Health + Medicine', 'https://theconversation.com/au/health/articles.atom'),
]
calibre_most_common_ua = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.87 Safari/537.36'

Last edited by Villard; 12-29-2023 at 05:52 AM.
Villard is offline   Reply With Quote
Old 12-29-2023, 07:07 AM   #7
unkn0wn
Evangelist
unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.
 
Posts: 468
Karma: 82692
Join Date: May 2021
Device: kindle
Code:
    def get_browser(self, *a, **kw):
        br = BasicNewsRecipe.get_browser(self, *a, **kw)
        br.set_cookie('tc_region', 'au', '.theconversation.com')
        return br
add this to the recipe and check.
unkn0wn is offline   Reply With Quote
Old 12-29-2023, 08:05 AM   #8
Villard
Connoisseur
Villard began at the beginning.
 
Posts: 64
Karma: 10
Join Date: May 2016
Device: Koreader running on Kobo Libra 2
Well done Unkn0wn ! It works !
You got the solution !
Thank you very much !
Now I'll take time to understand your piece of code ...
Thanks again !
Villard is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Why Wkipedia search does not querry the actual language list of any given article? MaxStirner KOReader 3 09-07-2023 02:38 PM
skip article? How is that possible in a recipe VoHegg Recipes 3 08-23-2020 06:01 AM
Single Article Recipe yank Recipes 1 10-25-2018 09:06 AM
NEW RECIPE REQUEST: The Conversation XanthanGum Recipes 0 01-14-2015 05:49 AM
Which Article? - Non-fiction, language study, exam preparation Addriana Self-Promotions by Authors and Publishers 0 12-07-2013 07:42 PM


All times are GMT -4. The time now is 04:03 PM.


MobileRead.com is a privately owned, operated and funded community.