MobileRead Forums - View Single Post

Starson17 · 03-10-2010, 09:15 PM

I'm working on a recipe using parse_index and soup to read a page at a URL that never changes. That first page has a link to a second page. The second page has a link to a third page, etc.

These pages contain the content (articles) that I want, as well as the links I want for the articles. I grab the first page, create the article link for that page and the article link for page 2 from the data on page 1. Then I read page 2 into BeautifulSoup, find the link for page 3 and stick that into my index, etc.

At this point everything is great. I've got my parsed index, and if I let it run, I get the content I want from my parsed index, just as if it had been read from an RSS feed.

However, trouble rears its head when I try to modify the pages with preprocess_html or use preprocess_regexps. It sort of looks like it's pulling the pages (that I've already downloaded to build my article/feed index) from a cache, instead of modifying them with preprocess_html before grabbing them. Has anyone seen this interaction, or have suggestions for dealing with this?

Thanks.

03-10-2010, 09:15 PM	#1
Starson17 Wizard Posts: 4,004 Karma: 177841 Join Date: Dec 2009 Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T	Do recipes use a cache? I'm working on a recipe using parse_index and soup to read a page at a URL that never changes. That first page has a link to a second page. The second page has a link to a third page, etc. These pages contain the content (articles) that I want, as well as the links I want for the articles. I grab the first page, create the article link for that page and the article link for page 2 from the data on page 1. Then I read page 2 into BeautifulSoup, find the link for page 3 and stick that into my index, etc. At this point everything is great. I've got my parsed index, and if I let it run, I get the content I want from my parsed index, just as if it had been read from an RSS feed. However, trouble rears its head when I try to modify the pages with preprocess_html or use preprocess_regexps. It sort of looks like it's pulling the pages (that I've already downloaded to build my article/feed index) from a cache, instead of modifying them with preprocess_html before grabbing them. Has anyone seen this interaction, or have suggestions for dealing with this? Thanks.