Quote:
Originally Posted by TonytheBookworm
I was wondering how would one parse a website that does something like this
Article Content then has a pagenation to go to the rest of the article then continue to the rest of the article and yet keep it all in one article?...
how would you do that? My first guess would be using parse_index()
|
I refer to this as a "multipage" article. No, you don't use parse_index. You use parse_index when you don't have an RSS feed and need to build your own feed by scraping. The multipage problem occurs later, when the articles in the feeds are actually being processed. At that point, you already have the feed (you might have gotten it by a normal RSS feed or by scraping and building your own with parse_index - it doesn't matter how).
Briefly, in multipage you use BeautifulSoup to grab each subsequent page by following the "next page" links and you append them all into the soup for the first page to make a large single BS object. Search this thread for "multipage." Look at the discussion I had with "rty" to see some examples. Search the builtin recipes for "append_page" or search here for that and you will find many examples of how-to.