FIX: New York Times Recipe
I've updated the New York Times recipe to resolve the issue from my previous post of occasionally missing articles. The problem was in the postprocess_html function, when a minor formatting problem causes the whole article to not be included. I've included the updated procedure, recipe, and explanation below (this may be happening in other recipes as well).
The caption of one of the photos has a paragraph within a paragraph:
<p class="caption"><p><em>“There’s no doubt in my mind that the whole trial will be about did he know right from wrong.”</em><strong> CLARENCE DUPNIK</strong> Pima County sheriff</p> </p>
The postprocess procedure thinks there are two paragraphs, and the second paragraph is empty. Thus, the variable caption.contents[0] throws an index out of range error.
for caption in soup.findAll(True, {'class':'caption'}) :
if caption and caption.contents[0]:
The first fix was just to change caption.contents[0] to len(caption) > 0. The second fix is that I added TRY / EXCEPT blocks to every minor change this procedure makes so that in the event there are small inconsistencies that would cause an article to not be included, the errors are logged but the article is still included.
Last edited by bcollier; 01-17-2011 at 01:52 PM.
|