I have been using the scmp.recipe in the recipe to scrape the South China Morning Post for several months now, but recently some issues have started to arise. A brief summary is as follows:
Incomplete Content: The content of each document is not fully retrieved, with some parts missing. Upon checking the source feeds (e.g.,
https://www.scmp.com/rss/2/feed), it appears that, much like the situation with The Economist Espresso servral months ago, the full content is not displayed. I’m uncertain if there is any other way to resolve this.
Invalid Content: The scraped content often contains irrelevant entries such as "Advertisement." I wonder if there is a way to filter such content during the scraping process.