View Single Post
Old 12-14-2015, 04:38 AM   #18
paddyrm
Connoisseur
paddyrm began at the beginning.
 
Posts: 69
Karma: 10
Join Date: Oct 2012
Device: Kindle 3
Supplemental feeds

The Guardian changed its web format drastically in November 2015. Prior to that extra section articles were stored in named folders, eg "Cook", "G2" etc and the old script would scrape all these in. A member of the Guardian's User Help team sent me a link to a missing article from the Cook section, pointing me to url www.theguardian.com/lifeandstyle/2015/nov/14/ and further investigation showed that nearly all articles from supplements are now stored in date folders.

Following Kovid's recommendation on adding feeds I added these line to the bottom of the Guardian recipe:

def parse_index(self):
feeds = self.parse_section(self.base_url)
feeds += self.parse_section('http://www.theguardian.com/politics/'+strftime('%Y/%b/%d'))
feeds += self.parse_section('http://www.theguardian.com/lifeandstyle/'+strftime('%Y/%b/%d'))
feeds += self.parse_section('http://www.theguardian.com/uk/commentisfree/'+strftime('%Y/%b/%d'))
feeds += self.parse_section('http://www.theguardian.com/travel/'+strftime('%Y/%b/%d'))
feeds += self.parse_section('http://www.theguardian.com/lifeandstyle/food-and-drink/'+strftime('%Y/%b/%d'))
feeds += self.parse_section('http://www.theguardian.com/tv-and-radio/'+strftime('%Y/%b/%d'))
feeds += self.parse_section('http://www.theguardian.com/theguardian/theguide/'+strftime('%Y/%b/%d'))
return feeds

and this works well for the Saturday Guardian which is my main interest. Other sections can be added for other days as needed.

For it to work two lines need to be added near the top of the script:

from calibre import strftime
(I have it at line 11) this brings in the PC time via calibre, to use in the feed urls above. I have used the trick of resetting my PC time to a previous Saturday to scrape an earlier issue!

ignore_duplicate_articles = {'title', 'url'}
(my line 38) needed because there may be several links to the same article in different parts of the newspaper.

Hope this may be a some use to other Guardian readers dismayed by the loss of wanted supplements! And thanks to Kovid for very helpful suggestions.

Paddy
paddyrm is offline   Reply With Quote