Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 10-13-2023, 02:00 AM   #1
unkn0wn
Evangelist
unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.
 
Posts: 462
Karma: 82692
Join Date: May 2021
Device: kindle
Question html fetched by calibre is different from what I see in browser (js disabled)

html content fetched in browser (with js disabled) is different from what is sent to calibre when fetching recipe.

How do I get the same html content in calibre as seen in a js disabled browser?

How is their server able to detect & send completely different content to a calibre bot?

This is the recipe
Spoiler:
Code:
from calibre.web.feeds.news import BasicNewsRecipe, classes

class newslaundry(BasicNewsRecipe):
    title = 'Newslaundry'
    __author__ = 'unkn0wn'
    description = (
        'Newslaundry is a reader-supported, independent news media company. In an industry driven by corporate'
        ' and government interests, we strongly believe in the need for an independent news model, and a free'
        ' and accountable press.'
    )
    language = 'en_IN'
    masthead_url = 'https://images.assettype.com/newslaundry/2020-01/d91cad07-9650-47e9-8bdc-9a6247354d95/Header_logo_NL__2_New.png'
    encoding = 'utf-8'
    no_stylesheets = True
    remove_javascript = True
    oldest_article = 7 # days
    resolve_internal_links = True

    ignore_duplicate_articles = {'url'}

    # keep_only_tags = [classes('headline subheadline authorWithTimeStamp story-card')]

    feeds = [
        ('Articles', 'https://www.newslaundry.com/stories.rss?time-period=last-7-days')
    ]

    # def preprocess_html(self, soup):
    #     if h1 := soup.find(**classes('headline')):
    #        h1.name = 'h1'
    #     if h3 := soup.find(**classes('subheadline')):
    #        h3.name = 'h3'
    #    return soup

    def print_version(self, url):
        if 'hindi.newslaundry' in url: self.abort_article('Skipping hindi article') # remove this line if you want hindi articles.
        return url


an example of html that I get in calibre.
Spoiler:
Code:
<html>
<head><title>Aaj Tak reporter paraglides in Haryana to demonstrate ‘Hamas attack’ in Israel</title></head>
<body>
<h2>Aaj Tak reporter paraglides in Haryana to demonstrate ‘Hamas attack’ in Israel</h2>
<div>
<p>Amid the escalated Israel-Palestine conflict, videos of “Hamas militants paragliding into Israel” have flooded social media. </p>
<p>While some media outlets have <ins><a href="https://www.usatoday.com/story/news/factcheck/2023/10/11/false-claim-video-shows-hamas-parachuters-invading-israel-fact-check/71144077007/">flagged it as false</a></ins>, <em>Aaj Tak </em>reporter Mausami Singh dived right into the adventure sport in Haryana’s Manesar to bring out an “exclusive report” on the purported attack.</p>
<p>How else can you explain what’s happening in a warzone without going paragliding yourself?</p>
<p>It was truly a race to see who could produce the most frivolous reportage yet.</p>
<p>As Singh took off in her paraglider, the text on <em>Aaj Tak</em> read: “<em>Khel ka ek upkaran kaise ban gaya atank ka hathiyar</em>”. (How did equipment for sport become a weapon for terrorism?)</p>
<p>Selfie stick in hand, Mausami delivered her piece to camera mid-air while a man behind her – hopefully a professional – steered the paraglider.</p>
<p>“<em>Yahan par aap samajhne ki koshish kijiye, jo pilot hote hain unhe bhi zada training ki zarurat nahi hain</em>,” she said. (Try to understand, the pilot here also doesn’t need a lot of training.)</p>
<p>She then attempted to explain how the paraglider worked: “<em>Ye iss hath se harness kar rahe hain aur humlog beech mein baith kar pendulum ki tarah</em>.” (He is holding the harness with his hand. And we are sitting in the centre like a pendulum.)</p>
<p>Sounds simple!</p>
<p>Singh’s audio soon worsened so <em>Aaj Tak</em> helpfully moved on to other videos of paragliding to show how it can be used to “easily cross over walls”. The segment also featured army officials who asserted that the “versatile equipment can take off within four to five minutes” and can pass over a boundary “at a distance of 1-2 km within 10-15 minutes”. One of these officers termed it a “poor man’s weapon” and a “terrorist’s weapon”.      </p>
<p><em>Aaj Tak</em> also interviewed a “paragliding enthusiast” who expressed her “shock” over her “joy ride” equipment being used for a terrorist attack. </p>
<p>“Unthinkable,” she said.  </p>
<p>Truly it is. </p>
<aside><a href="https://www.newslaundry.com/2023/10/11/israel-strikes-media-offices-at-least-6-palestinian-journalists-killed">Israel strikes media offices; at least 6 Palestinian journalists killed</a></aside><aside><a href="https://www.newslaundry.com/2023/10/11/israel-palestine-india-response-modi-netanyahu-ministry-of-external-affairs">Why India’s MEA still hasn’t issued an official statement on Israel-Palestine</a></aside>
</div>
</body>
</html>


I don't even need to set auto_cleanup = True

I also used browser user_agent = 'common_words/based' and still get this simplified html content.

how do i set up get_browser to look like a firefox!
unkn0wn is offline   Reply With Quote
Old 10-13-2023, 06:38 AM   #2
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 44,017
Karma: 22669822
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
There can be a lot of things a site can use from the ssl handshake algorithms, to http request headers. You can visit the site in a browser with developer tools and see exactly what request headers are sent and mimic that in the recipe. That might work.
kovidgoyal is offline   Reply With Quote
Advert
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
? about html entries in Book Browser Gregg Bell Sigil 6 05-09-2013 09:28 PM
Local html in browser? mm5 iRiver Story 2 02-16-2012 06:43 PM
Calibre not emailing fetched news to Kindle pierda Calibre 1 12-12-2010 08:53 PM
Calibre Recipe HTML content differs from raw html of index.html. krunk Calibre 4 09-20-2010 09:48 PM


All times are GMT -4. The time now is 12:07 PM.


MobileRead.com is a privately owned, operated and funded community.