View Single Post
Old 11-11-2011, 03:05 PM   #11
NotTaken
Connoisseur
NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.NotTaken is fluent in JavaScript as well as Klingon.
 
Posts: 65
Karma: 4640
Join Date: Aug 2011
Device: kindle
A few more changes

Some changes:
  • Filter feeds with title prefix 'Video:' - most only have one line of text
  • Prevented duplicated content by setting recursions to zero and checking url existence against a list of feeds already processed
  • Removed line breaks and empty paragraphs from the storyTop section as these cause unsightly white space (tries to sensibly replace line breaks between text with spaces)
  • Try to fetch extra images related to the content when labeled with ...Click here for graphic... (this may need improving if the pattern changes wildy) - see this page for an example
  • Added some flags up the top to disable image fetching

I was thinking about removing the advertorial articles (see here) but could not see a clean way of doing this. As far as I am aware, they are only identifiable by the text 'Advertorial Feature ' in <div class=" ... strapLine"> so I was thinking of returning None in preprocess_soup if the text was found (this causes an AttributeError exception to be raised). Can anyone think of a nicer solution?
Attached Files
File Type: zip independent.recipe.zip (4.6 KB, 65 views)

Last edited by NotTaken; 11-11-2011 at 03:09 PM.
NotTaken is offline   Reply With Quote