View Single Post
Old 01-25-2016, 05:53 AM   #1
bubak
Connoisseur
bubak began at the beginning.
 
Posts: 65
Karma: 10
Join Date: Dec 2010
Device: kindle voyage
Multiple Page Sites

The reusable code to load multiple-page articles is IMHO wrong. It uses preprocess_html which is applied "after the cleanup as specified by remove_tags etc.", so no cleanup is done on the following pages, at least this is what I experience on FAZ.NET. This site in particular offers a link to 'Article on one page', so this could be used before cleanup instead of appending pages, but I'm not sure what would be the correct way, skip_ad_pages (but this accepts soup but returns the HTML, so in case this page is ok, one cannot use it) or get_article_url(then the article might have to be loaded twice). Couldn't we have a function that gets and returns the same object, soup or text and is applied right after loading the article content?
bubak is offline   Reply With Quote