Quote:
Originally Posted by kovidgoyal
All pages are processed by postprocess_html and all pages have remove_tags applied to them.
|
I did some re-tests and am sorry to say I cannot confirm this.
For additional pages, I download in preprocess or postprocess_html with self.index_to_soup(url), remove_tags is not applied. (In my case, a certain div is not removed.)
If I log the soup given to preprocess_html, remove_tags has already been applied. (In my case, that certain div is already removed.)
If I download additional pages with self.index_to_soup(url) in preprocess_html and add it to the original first page with "insert", this very page then is processed by postprocess_html. remove_tags is not re-applied to this complete page. (In my case, that certain div is not removed from the complete page then.)
I'm not complaining here, this sounds more then logical

I am just curious if there there is any way to re-process a page downloaded within preprocess_html the same way any other page is downloaded? I only have the remove_tags issue now, and can certainly re-implement it in preprocess_html, this doesn't sound like a smart way to do it though.
Cheers,
- aero