Do recipes use a cache?
I'm working on a recipe using parse_index and soup to read a page at a URL that never changes. That first page has a link to a second page. The second page has a link to a third page, etc.
These pages contain the content (articles) that I want, as well as the links I want for the articles. I grab the first page, create the article link for that page and the article link for page 2 from the data on page 1. Then I read page 2 into BeautifulSoup, find the link for page 3 and stick that into my index, etc.
At this point everything is great. I've got my parsed index, and if I let it run, I get the content I want from my parsed index, just as if it had been read from an RSS feed.
However, trouble rears its head when I try to modify the pages with preprocess_html or use preprocess_regexps. It sort of looks like it's pulling the pages (that I've already downloaded to build my article/feed index) from a cache, instead of modifying them with preprocess_html before grabbing them. Has anyone seen this interaction, or have suggestions for dealing with this?
Thanks.
|