View Single Post
Old 03-10-2010, 08:15 PM   #1
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Do recipes use a cache?

I'm working on a recipe using parse_index and soup to read a page at a URL that never changes. That first page has a link to a second page. The second page has a link to a third page, etc.

These pages contain the content (articles) that I want, as well as the links I want for the articles. I grab the first page, create the article link for that page and the article link for page 2 from the data on page 1. Then I read page 2 into BeautifulSoup, find the link for page 3 and stick that into my index, etc.

At this point everything is great. I've got my parsed index, and if I let it run, I get the content I want from my parsed index, just as if it had been read from an RSS feed.

However, trouble rears its head when I try to modify the pages with preprocess_html or use preprocess_regexps. It sort of looks like it's pulling the pages (that I've already downloaded to build my article/feed index) from a cache, instead of modifying them with preprocess_html before grabbing them. Has anyone seen this interaction, or have suggestions for dealing with this?

Thanks.
Starson17 is offline   Reply With Quote