Some changes:
- Filter feeds with title prefix 'Video:' - most only have one line of text
- Prevented duplicated content by setting recursions to zero and checking url existence against a list of feeds already processed
- Removed line breaks and empty paragraphs from the storyTop section as these cause unsightly white space (tries to sensibly replace line breaks between text with spaces)
- Try to fetch extra images related to the content when labeled with ...Click here for graphic... (this may need improving if the pattern changes wildy) - see this page for an example
- Added some flags up the top to disable image fetching
I was thinking about removing the advertorial articles (
see here) but could not see a clean way of doing this. As far as I am aware, they are only identifiable by the text 'Advertorial Feature ' in <div class=" ... strapLine"> so I was thinking of returning None in preprocess_soup if the text was found (this causes an AttributeError exception to be raised). Can anyone think of a nicer solution?