MobileRead Forums - View Single Post

Aimylios · 07-13-2016, 03:42 PM

I was able to reproduce your problem. The reason seems to be that the index page is just too big for BeautifulSoup (almost 300 kB). Therefore the soup returned by index_to_soup (line 52 of the foreignaffairs.recipe) is incomplete.
Looking at the implementation of index_to_soup in news.py I see that the initial download is successful (i.e. variable _raw contains the whole page as a string). It's the final conversion in line 682 which returns a broken data structure.

I suspect that the capability of BeautifulSoup to handle such big pages depends on various parameters (operating system, 32 vs 64 bit, memory, etc.), so that not all users will experience this problem. Maybe Kovid has an idea for a workaround. I guess one could reimplement parse_index without using BeautifulSoup and instead rely on another library like html5lib.

07-13-2016, 03:42 PM	#15
Aimylios Member Posts: 17 Karma: 10 Join Date: Apr 2016 Device: Tolino Vision 3HD	I was able to reproduce your problem. The reason seems to be that the index page is just too big for BeautifulSoup (almost 300 kB). Therefore the soup returned by index_to_soup (line 52 of the foreignaffairs.recipe) is incomplete. Looking at the implementation of index_to_soup in news.py I see that the initial download is successful (i.e. variable _raw contains the whole page as a string). It's the final conversion in line 682 which returns a broken data structure. I suspect that the capability of BeautifulSoup to handle such big pages depends on various parameters (operating system, 32 vs 64 bit, memory, etc.), so that not all users will experience this problem. Maybe Kovid has an idea for a workaround. I guess one could reimplement parse_index without using BeautifulSoup and instead rely on another library like html5lib.